We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!
Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
We will be out of the office for a Broad Institute event from Dec 10th to Dec 11th 2019. We will be back to monitor the GATK forum on Dec 12th 2019. In the meantime we encourage you to help out other community members with their queries.
Thank you for your patience!
Unified Genotyper: Minimum fraction for SNPs

In working with a mapping that has a very large coverage in certain places, I have noticed that the Unified Genotyper often calls SNPs even if the alternate allele is only present in a minuscule <1% fraction of the reads at a position. It is difficult to filter these SNPs out because QUAL is high for these SNPs the and QD, which is low in these situations, is also low for many other (good) SNPs.
There already is a fractional threshold option for indel calling, -minIndelFrac. Is there any similar option for SNPs – and if not, what is your reasoning for omitting this option and what steps do you recommend for filtering out these SNPs?
Best Answer
-
droazen Cambridge, MA ✭✭
The GATK's approach to downsampling is based on even representation of reads from every alignment start position, rather than a truly random sampling. Is it possible that the reads with the mutations all share the same (or a small number of) alignment start positions?
Answers
No, there is not a similar option for SNPs, the calling model should be handling these correctly. Have you looked at the mapping qualities of the reads in those regions? Very high coverage (especially compared to the rest of the genome) can be indicative of poor mapping quality. If that's the case, the GATK will largely ignore the reads that are poorly mapped, so it may be that among the reads actually used for calling, the fraction supporting the calls is better than it looks at first glance.
We are only targeting fairly small regions of the genome, and to ensure every base is covered we are intentionally aiming for a high average coverage (>1000x). Perhaps that is a bit excessive, but it is outside of my control. The read quality is generally very good, but usually the errant SNPs that get called in these cases come from a tiny fraction of reads at their base (1%), which often map badly and which I am told probably are sequencing artefacts.
Ah I see. We don't work with small target sequencing much so we don't have specific recommendations for dealing with that.
It's odd that the poorly mapped reads would be causing artifactual SNPs to be called. Normally they should be getting ignored or at least not counted for much. If any SNPs are called from them, the confidence scores should be pretty low, and the allele depth (AD) should also reflect their minority nature. Could you post a few example lines from the VCF?
These here have a very low quality by depth. The reads here make up around 0.25% of the reads at each position:
Somewhat better quality by depth:
For comparison a bona-fide SNP with an almost precisely 50/50 division between reference and alternate allele over a depth >>250. The QD is somewhat higher than in the previous examples, but it is still very low. The overall read quality at this position is similar to the other examples. SNPs like these make it difficult for me to use QD as a threshold:
I'm not sure I understand where you get your 0.25 % figure from. From the DP=250 field, I see that the downsampling is set at the default, probably. Do you mean to say that you have 100,000 x coverage in these regions? If you want to use more of the reads you need to relax the downsampling.
Based on the DP and AD annotations, those snps are supported by about 25 % of the reads used for calling. Not ideal if your organism is diploid, but certainly not "minuscule", and more than you'd expect by chance due to sequencing error. Have you looked at the pileups?
Yes, in these cases the coverage is in the 100,000s. I do get similar problems with coverage in the 1000s though.
You can try calling these again with a more relaxed downsampling setting (eg -dcov 1000). If you're calling from only a small subset
By the way, what version are you using? We recently upgraded the downsampler because the previous implementation suffered from some biases. Can't say that's what's causing your issue but it is a possibility.
I have upgraded to the latest version of GATK and used
-dcov 1000
. The UnifiedGenotyper still generates fairly similar SNPs. It generates fewer bad SNPs than before, however still at a rate far in excess of what I would expect from simple bad luck with the random sampling.Some SNPs from one of the regions called earlier where the alternate allele has less than 1% of reads:
The HaplotypeCaller performs as I would expect on both the good and bad example SNPs from earlier, so I will consider using it for these kinds of samples in the future.
I'm sorry if I'm being thick-headed here, but I don't understand what you think is so wrong with these calls. Looking at the first line, at a DP of 1000 (due to downsampling), you have a 0/1 het genotype where the AD shows the REF allele is supported by 728 reads vs. 249 reads for the ALT allele. Again, that's about 25% support for the variant call, which is not huge but also a far cry from the less than 1% you're quoting. I just don't see where you get that number from.
What numbers are you getting for the same sites with HaplotypeCaller?
Ultimately it's great that you like the HaplotypeCaller better, since the plan is that it will eventually replace the UG entirely...
These SNPs are over a very small region (an exon) that has 100-200,000x coverage. Almost none of the reads have any mutations, but for some reason the downsampling picks a vastly disproportionate fraction of reads that do have a base mutation. So in one case the UG may downsample 100,000/1000 to 728/249.
Have you checked the mapping qualities of the large numbers of reads that don't get picked? It's possible that they're getting filtered out prior to downsampling due to low mapping quality or other issues. Or they may be marked as duplicates, which would also rule them out prior to downsampling.
Do you get calls for these sites with the HaplotypeCaller.
I re-ran the UG earlier today on just that region and the run statistics listed no reads filtered at all. I had suspected that the duplicate read filter might have caused it, but that does not seem to be the case. I do not get any calls there with the HaplotypeCaller.
The GATK's approach to downsampling is based on even representation of reads from every alignment start position, rather than a truly random sampling. Is it possible that the reads with the mutations all share the same (or a small number of) alignment start positions?
That explains the problem. Many of the good reads in this sample start at similar points, whereas many of the bad reads are trimmed to different lengths and start at arbitrary positions. I will take that into account from now on.
That would do it -- if you'd prefer a truly random sampling to the default positional/alignment-start-based downsampling, you can try experimenting with using -dfrac rather than -dcov.