Bug Bulletin: The recent 3.2 release fixes many issues. If you run into a problem, please try the latest version before posting a bug report, as your problem may already have been solved.

Unified Genotyper: Minimum fraction for SNPs

lyschoeninglyschoening Posts: 12Member

In working with a mapping that has a very large coverage in certain places, I have noticed that the Unified Genotyper often calls SNPs even if the alternate allele is only present in a minuscule <1% fraction of the reads at a position. It is difficult to filter these SNPs out because QUAL is high for these SNPs the and QD, which is low in these situations, is also low for many other (good) SNPs.

There already is a fractional threshold option for indel calling, -minIndelFrac. Is there any similar option for SNPs – and if not, what is your reasoning for omitting this option and what steps do you recommend for filtering out these SNPs?

Best Answer

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,989Administrator, GATK Developer admin

    No, there is not a similar option for SNPs, the calling model should be handling these correctly. Have you looked at the mapping qualities of the reads in those regions? Very high coverage (especially compared to the rest of the genome) can be indicative of poor mapping quality. If that's the case, the GATK will largely ignore the reads that are poorly mapped, so it may be that among the reads actually used for calling, the fraction supporting the calls is better than it looks at first glance.

    Geraldine Van der Auwera, PhD

  • lyschoeninglyschoening Posts: 12Member

    We are only targeting fairly small regions of the genome, and to ensure every base is covered we are intentionally aiming for a high average coverage (>1000x). Perhaps that is a bit excessive, but it is outside of my control. The read quality is generally very good, but usually the errant SNPs that get called in these cases come from a tiny fraction of reads at their base (1%), which often map badly and which I am told probably are sequencing artefacts.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,989Administrator, GATK Developer admin

    Ah I see. We don't work with small target sequencing much so we don't have specific recommendations for dealing with that.

    It's odd that the poorly mapped reads would be causing artifactual SNPs to be called. Normally they should be getting ignored or at least not counted for much. If any SNPs are called from them, the confidence scores should be pretty low, and the allele depth (AD) should also reflect their minority nature. Could you post a few example lines from the VCF?

    Geraldine Van der Auwera, PhD

  • lyschoeninglyschoening Posts: 12Member
    edited June 2013

    These here have a very low quality by depth. The reads here make up around 0.25% of the reads at each position:

    chr7    140481386   .   T   C   71.77   .   AC=1;AF=0.500;AN=2;BaseQRankSum=-7.867;DP=250;Dels=0.00;FS=0.000;HaplotypeScore=633.7063;MLEAC=1;MLEAF=0.500;MQ=23.79;MQ0=0;MQRankSum=0.101;QD=0.29;ReadPosRankSum=1.371    GT:AD:DP:GQ:PL  0/1:157,80:236:99:100,0,1323
    chr7    140481401   .   T   C   39.77   .   AC=1;AF=0.500;AN=2;BaseQRankSum=-6.077;DP=250;Dels=0.00;FS=0.000;HaplotypeScore=827.7223;MLEAC=1;MLEAF=0.500;MQ=25.56;MQ0=0;MQRankSum=-0.250;QD=0.16;ReadPosRankSum=1.952   GT:AD:DP:GQ:PL  0/1:159,63:236:68:68,0,1003
    chr7    140481442   .   G   C   137.77  .   AC=1;AF=0.500;AN=2;BaseQRankSum=-5.113;DP=250;Dels=0.00;FS=0.000;HaplotypeScore=1005.6086;MLEAC=1;MLEAF=0.500;MQ=27.81;MQ0=0;MQRankSum=2.597;QD=0.55;ReadPosRankSum=1.659   GT:AD:DP:GQ:PL  0/1:135,67:236:99:166,0,716
    chr7    140481451   .   G   A   161.77  .   AC=1;AF=0.500;AN=2;BaseQRankSum=-3.717;DP=250;Dels=0.00;FS=4.752;HaplotypeScore=924.8052;MLEAC=1;MLEAF=0.500;MQ=28.50;MQ0=0;MQRankSum=1.976;QD=0.65;ReadPosRankSum=-0.387   GT:AD:DP:GQ:PL  0/1:130,75:236:99:190,0,1149
    chr7    140481459   .   C   A   290.77  .   AC=1;AF=0.500;AN=2;BaseQRankSum=-0.825;DP=250;Dels=0.00;FS=2.753;HaplotypeScore=1008.8916;MLEAC=1;MLEAF=0.500;MQ=28.91;MQ0=0;MQRankSum=0.784;QD=1.16;ReadPosRankSum=2.189   GT:AD:DP:GQ:PL  0/1:168,70:236:99:319,0,1189
    

    Somewhat better quality by depth:

    chr17   7578240 .   C   T   547.77  .   AC=1;AF=0.500;AN=2;BaseQRankSum=-2.119;DP=224;Dels=0.00;FS=41.712;HaplotypeScore=340.9826;MLEAC=1;MLEAF=0.500;MQ=38.77;MQ0=0;MQRankSum=6.252;QD=2.45;ReadPosRankSum=-3.799  GT:AD:DP:GQ:PL  0/1:172,45:209:99:576,0,4046
    

    For comparison a bona-fide SNP with an almost precisely 50/50 division between reference and alternate allele over a depth >>250. The QD is somewhat higher than in the previous examples, but it is still very low. The overall read quality at this position is similar to the other examples. SNPs like these make it difficult for me to use QD as a threshold:

    chr17   7578115 rs1625895   T   C   1134.77 .   AC=1;AF=0.500;AN=2;BaseQRankSum=-4.481;DB;DP=250;Dels=0.02;FS=9.855;HaplotypeScore=980.0644;MLEAC=1;MLEAF=0.500;MQ=29.72;MQ0=0;MQRankSum=-3.231;QD=4.54;ReadPosRankSum=-2.127   GT:AD:DP:GQ:PL  0/1:86,89:228:99:1163,0,1752
    
    Post edited by lyschoening on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,989Administrator, GATK Developer admin

    The reads here make up around 0.25% of the reads at each position:

    I'm not sure I understand where you get your 0.25 % figure from. From the DP=250 field, I see that the downsampling is set at the default, probably. Do you mean to say that you have 100,000 x coverage in these regions? If you want to use more of the reads you need to relax the downsampling.

    Based on the DP and AD annotations, those snps are supported by about 25 % of the reads used for calling. Not ideal if your organism is diploid, but certainly not "minuscule", and more than you'd expect by chance due to sequencing error. Have you looked at the pileups?

    Geraldine Van der Auwera, PhD

  • lyschoeninglyschoening Posts: 12Member

    Yes, in these cases the coverage is in the 100,000s. I do get similar problems with coverage in the 1000s though.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,989Administrator, GATK Developer admin

    You can try calling these again with a more relaxed downsampling setting (eg -dcov 1000). If you're calling from only a small subset

    By the way, what version are you using? We recently upgraded the downsampler because the previous implementation suffered from some biases. Can't say that's what's causing your issue but it is a possibility.

    Geraldine Van der Auwera, PhD

  • lyschoeninglyschoening Posts: 12Member

    I have upgraded to the latest version of GATK and used -dcov 1000. The UnifiedGenotyper still generates fairly similar SNPs. It generates fewer bad SNPs than before, however still at a rate far in excess of what I would expect from simple bad luck with the random sampling.

    Some SNPs from one of the regions called earlier where the alternate allele has less than 1% of reads:

    chr7    140481459   .   C   A   146.77  .   AC=1;AF=0.500;AN=2;BaseQRankSum=-13.332;DP=1000;Dels=0.00;FS=123.084;HaplotypeScore=2248.3873;MLEAC=1;MLEAF=0.500;MQ=23.44;MQ0=0;MQRankSum=6.327;QD=0.15;ReadPosRankSum=-8.231  GT:AD:DP:GQ:PL  0/1:728,249:950:99:175,0,9467
    chr7    140481499   .   T   A   1232.77 .   AC=1;AF=0.500;AN=2;BaseQRankSum=-20.142;DP=1000;Dels=0.01;FS=159.114;HaplotypeScore=2034.4262;MLEAC=1;MLEAF=0.500;MQ=23.72;MQ0=0;MQRankSum=12.060;QD=1.23;ReadPosRankSum=-12.800    GT:AD:DP:GQ:PL  0/1:605,358:940:99:1261,0,8545
    chr7    140481508   .   T   A   931.77  .   AC=1;AF=0.500;AN=2;BaseQRankSum=-18.568;DP=997;Dels=0.01;FS=137.931;HaplotypeScore=1507.9040;MLEAC=1;MLEAF=0.500;MQ=23.76;MQ0=0;MQRankSum=10.657;QD=0.93;ReadPosRankSum=-2.159  GT:AD:DP:GQ:PL  0/1:605,336:940:99:960,0,7559
    chr7    140481517   .   C   A   342.77  .   AC=1;AF=0.500;AN=2;BaseQRankSum=-10.911;DP=993;Dels=0.01;FS=154.228;HaplotypeScore=2228.0195;MLEAC=1;MLEAF=0.500;MQ=23.62;MQ0=0;MQRankSum=7.461;QD=0.35;ReadPosRankSum=2.393    GT:AD:DP:GQ:PL  0/1:733,223:936:99:371,0,8940
    chr7    140481533   .   G   A   76.77   .   AC=1;AF=0.500;AN=2;BaseQRankSum=-17.460;DP=990;Dels=0.02;FS=96.720;HaplotypeScore=1373.2502;MLEAC=1;MLEAF=0.500;MQ=23.56;MQ0=0;MQRankSum=8.717;QD=0.08;ReadPosRankSum=6.640 GT:AD:DP:GQ:PL  0/1:689,274:926:99:105,0,9508
    

    The HaplotypeCaller performs as I would expect on both the good and bad example SNPs from earlier, so I will consider using it for these kinds of samples in the future.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,989Administrator, GATK Developer admin

    I'm sorry if I'm being thick-headed here, but I don't understand what you think is so wrong with these calls. Looking at the first line, at a DP of 1000 (due to downsampling), you have a 0/1 het genotype where the AD shows the REF allele is supported by 728 reads vs. 249 reads for the ALT allele. Again, that's about 25% support for the variant call, which is not huge but also a far cry from the less than 1% you're quoting. I just don't see where you get that number from.

    What numbers are you getting for the same sites with HaplotypeCaller?

    Ultimately it's great that you like the HaplotypeCaller better, since the plan is that it will eventually replace the UG entirely...

    Geraldine Van der Auwera, PhD

  • lyschoeninglyschoening Posts: 12Member
    edited June 2013

    These SNPs are over a very small region (an exon) that has 100-200,000x coverage. Almost none of the reads have any mutations, but for some reason the downsampling picks a vastly disproportionate fraction of reads that do have a base mutation. So in one case the UG may downsample 100,000/1000 to 728/249.

    Post edited by lyschoening on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,989Administrator, GATK Developer admin

    Have you checked the mapping qualities of the large numbers of reads that don't get picked? It's possible that they're getting filtered out prior to downsampling due to low mapping quality or other issues. Or they may be marked as duplicates, which would also rule them out prior to downsampling.

    Do you get calls for these sites with the HaplotypeCaller.

    Geraldine Van der Auwera, PhD

  • lyschoeninglyschoening Posts: 12Member

    I re-ran the UG earlier today on just that region and the run statistics listed no reads filtered at all. I had suspected that the duplicate read filter might have caused it, but that does not seem to be the case. I do not get any calls there with the HaplotypeCaller.

  • lyschoeninglyschoening Posts: 12Member

    That explains the problem. Many of the good reads in this sample start at similar points, whereas many of the bad reads are trimmed to different lengths and start at arbitrary positions. I will take that into account from now on.

  • droazendroazen Posts: 51GATK Developer mod

    That would do it -- if you'd prefer a truly random sampling to the default positional/alignment-start-based downsampling, you can try experimenting with using -dfrac rather than -dcov.

Sign In or Register to comment.