This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!
Common SNPs obscured by identical VariantContext from HaplotypeCaller
Hi GATK team
During one of our recent analyses we found many cases where common SNPs are reported in a very unexpected way by GATK4 Haplotypecaller and this makes our downstream analysis unnecessary hard as we usually assume that SNPs are actually encoded as a SNP. This looks very much like a bug to me, but I am interested in finding out whether there is actually a good reason in encoding the variants this way. In a cohort of 5000 patients, we saw that almost 2.5% of common SNPS where affected (200k out of 8M variants) and for larger cohorts the percetage goes up to 5%.
Here are some examples of called variants from a 160 patient cohort (sites-only, to make it small) where a SNP is obscured by some variant context that is identical between REF and all ALTs.
chr1 821054 . GTCTATACTACCTGCCTGTCGAGCAGAT CTCTATACTACCTGCCTGTCGAGCAGAT,ATCTATACTACCTGCCTGTCGAGCAGAT,* 4897594.93
chr1 876041 rs61768170 TACTCCCCCAC AACTCCCCCAC,* 319.96
chr1 93326098 rs71730518 CT CTT,CTTT,* 38399.11
chr1 112407840 rs10589164 CT CTTTT,CTT,CTTTTT,* 16295.78
chr1 158902694 rs5778098 CTT CT,* 118255.17
All those variants can be trimmed by removing the ends of REF and ALT sequences. Some end up as simple SNPs, while others remain albeit shorter indels:
chr1 821054 . G C,A,* 4897594.93 PASS
chr1 876041 rs61768170 T A,* 319.96
chr1 93326098 rs71730518 C CT,CTT,* 38399.11
chr1 112407840 rs10589164 C CTTT,CT,CTTTT,* 16295.78
chr1 158902694 rs5778098 CT C,* 118255.17
I noticed that all affected sites have an upstream deletion (*) so maybe this notation is trying to tell us something that we just don't understand?
Now assuming that there is a good reason why these variants are represented this way, how can I best trim the variants to the actually varying part? I tried
bcftools norm and GATKs LeftAlignAndTrimVariants, but both did not change any of these sites. Only
LeftAlignAndTrimVariants --split worked, however it also splits the multiallelic sites into multiple lines, which comes with its own limitations.
In case it helps here are some more details: All samples are human, normal (non-tumor) samples sequenced to 30x. Individual gvcf were produced by HaplotypeCaller in gvcf mode using GATK 3.5. bams came from a functional equivalence pipeline. Joint genotypes were called using a published GATK4 wdl: https://github.com/gatk-workflows/gatk4-germline-snps-indels.. I also tested using a GATK 3.5 Joint calling workflow and it shows the exact same behavior, so it does not seem to be caused by genomicsDB or other GATK4 features.
thanks for any insight