Bug Bulletin: The GenomeLocPArser error in SplitNCigarReads has been fixed; if you encounter it, use the latest nightly build.

[Screenshot + info] SNP visible in IGV but is not called by UnifiedGenotyper

nikmalnikmal Posts: 23Member
edited April 2013 in Ask the GATK team

Greetings!

First of all, thank you for a truly great toolkit! It is no doubt the best one out there.

Now, I have a question regarding visualization of a SNP that is not called by UG but looks convincing in IGV. Yes, I've looked at the FAQ page gatkforums.broadinstitute.org/discussion/1235/why-didnt-the-unified-genotyper-call-my-snp-i-can-see-it-right-there-in-igv but I'm still not completely convinced that this is a false positive.

The BAM files have gone through the Best Practices workflow prior to SNP calling. Calling was done using UG with subsequent recalibration steps, where I followed the guidelines under gatkforums.broadinstitute.org/discussion/1259/what-vqsr-training-sets-arguments-should-i-use-for-my-specific-project. SNP calling was done using GATK 2.4-9.

Below is a screenshot from IGV showing the SNP call:

Fullsize here: s24.postimg.org/sepow851v/igv_snp.png

The average mapping quality for the reads that include the SNP is 50 and the average base quality at the locus of the SNP is 28.7 (not including 4 positions where base quality is below 10). These values are calculated from the values shown by IGV

Are these values really too low to not confidently call this SNP? I mean a base quality of 28.7 means a probability of 99.87% that the base call is correct. Isn't that good enough?

Please help me understand this, and let me know if you need more information. :)

Post edited by nikmal on

Best Answer

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,295 admin
    Answer ✓

    Ah, I see. Two things:

    1. The MQ is the quadratic mean of the mapping quals for all reads at the sites; its value shouldn't be interpreted like a classic phred-scaled mapq value. To give you a sense of scale, when using hard filters instead of VQSR, it is common to filter out variants with MQ < 40. So here, if you have a lot of variants with great MQ, it's possible that the model determined that 48 is below acceptable relative to the rest. Having a small call set (as warned) may contribute to that.

    2. Choosing a ts level of 99.0 is actually pretty stringent; it means that you are looking to retrieve "only" 99% of true sites. Again, if you have a lot of high quality variants, that will lead to respectable snps like this one being unfairly filtered out.

    So this is potentially a sign that your call set quality is pretty good and you can afford to lower your specificity requirement in favor of more sensitivity. I would recommend changing your ts level to 99.9%. That should rescue this snp while still giving you a good quality call set.

    Geraldine Van der Auwera, PhD

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,295Administrator, GATK Developer admin

    Oh, that does look like a nice clean snp. Two questions:

    1. Are you using the latest version of GATK?

    2. What call and emit thresholds did you use for calling?

    Geraldine Van der Auwera, PhD

  • nikmalnikmal Posts: 23Member
    1. Yes, I used 2.4-9 for the SNP calling. The BAM file pre-processing was done using version 2.2-3 (if that makes a difference).

    2. I used the default settings, so 30.0 for both call and emit.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,295Administrator, GATK Developer admin

    Ok, try dropping the emit threshold to 10 and call just that region again. The snp may get called as LowQual, but at least that'll give you the call statistics, which may shed some light on why the UG is not happy with that site.

    Geraldine Van der Auwera, PhD

  • nikmalnikmal Posts: 23Member

    Alright, I re-ran the SNP calling with lowered emit threshold. Even though I ran it on the entire chromosome, I still got a warning message about few variants from VariantRecalibrator (if that makes a difference). Anyway, this is what the SNP looks like in the VCF output after ApplyRecalibration:

    - ------ rs---- T C 3465.20 VQSRTrancheSNP99.00to99.90 AC=4;AF=1.00;AN=4;DB;DP=110;Dels=0.00;FS=0.000;HaplotypeScore=1.0616;MLEAC=4;MLEAF=1.00;MQ=48.92;MQ0=0;QD=31.50;VQSLOD=-5.803e-01;culprit=MQ GT:AD:DP:GQ:PL 1/1:1,48:48:99:1479,135,0 1/1:0,61:61:99:2013,174,0

    So the filter field says "VQSRTrancheSNP99.00to99.90", which means what exactly? The --ts_filter_level option was set to 99.0 so shouldn't this SNP pass the filter? If I understand correctly, this translates to: "The SNP didn't pass the filter because the truth sensitivity level was between 99 and 99.9". Well, I set the threshold to 99.0 so shouldn't this be ok?

    The VCF header has the following information: ##FILTER=<ID=VQSRTrancheSNP99.00to99.90,Description="Truth sensitivity tranche level for SNP model at VQS Lod: -11.1528 <= x < 0.132">

    In the INFO field, the culprit is stated as MQ. However, the MQ is at 48.92 which is close to 99.999%, unless I'm mistaken in this case as well.

    What is going on here? :)

Sign In or Register to comment.