We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!


blueskypyblueskypy Member ✭✭
edited July 2015 in Ask the GATK team

We are doing WES on hundreds of samples and the average sequencing depth is 60X. I used the sensitivity 99.9% for the PASS filter in VQSR. Attached is the histagram of VQSLOD for around 900,000 SNPs with PASS filter and call rate > 95%. I wonder if you can help me with two questions?
1. is it normal to have half of the SNPs with VQSLOD < 0?
2. why so few SNPs around VQSLOD = 0?


Issue · Github
by Sheila

Issue Number
Last Updated
Closed By


  • blueskypyblueskypy Member ✭✭

    A 2nd question, I'm trying to use the following command to extract SNPs of VQSLOD > 0:

    java -Xmx10g -jar $gatk -T SelectVariants \
        -R $refGenome \
        -V SNP.PASS.vcf.gz \
        -select "VQSLOD > 0" \
        -o SNP.PASS.VQSLOD.0.vcf.gz

    but got this error:

    Invalid JEXL expression detected for select-0 with message For input string: "-3.415e+00"

    because the 1st SNP has "VQSLOD=-3.415e+00"

  • tommycarstensentommycarstensen United KingdomMember ✭✭✭
    edited July 2015

    @blueskypy I can help your with your 2nd/3rd question. You are meant to apply the variant recalibration/filtering with ApplyRecalibration, if you follow the default practices. VQSR is a two (four with indels) step process, if you follow the best practices. Here is the VQSR FAQ:

    I'm just an end user like you. Someone smart will probably illuminate us on your questions 1a and 1b ;) I had never noticed the bimodal distribution myself...

    Oh, I forgot to mention. ApplyRecalibration takes TruthSensitivity Ratios as input --ts_filter_level, so you need to look in your tranches files, which threshold is equivalent to a VQSLOD of 0. Does that make sense?

    I have never noticed this before, but there seems to be an --lodCutoff option available for AR. I wouldn't recommend filtering based on VQSLOD values, but I could be wrong...

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @blueskypy and @tommycarstensen

    I believe the bimodal distribution of the VQSLOD is expected because it results from taking the ratio of the negative and positive model values for the site. I'm not sure if the split is expected to be 50/50 in all cases, but I wouldn't be surprised if it was -- I'll ask the mathketeers.

    We don't encourage filtering directly on VQSLOD because it takes away your ability to derive the tradeoff between sensitivity and specificity from how the modeling performed on a gold standard (eg hapmap).

  • tommycarstensentommycarstensen United KingdomMember ✭✭✭

    @blueskypy I'm not convinced your distributions from which the VQSLOD values are calculated is normal then. I tried plotting a histogram of the log(p(x)/q(x)) of various distributions p(x) and q(x) and I didn't see anything bimodal. @Geraldine_VdAuwera I'm actually interested in hearing what the mathketeers have to say, if you still have any of them working there.

    P.S. Geraldine, I think I have discovered how to avoid a forum bug, which has annoyed me for a long time.

    I can see the figure posted by @blueskypy if I follow this link to the thread:

    I can't see it (but I can see yours), if I follow this link to the most recent comment in the thread:

    I think this is the reason you and @Sheila often report that you can't see my attachments. The forum is cool, but no flawless victory here...

    Here is the hard link to your plots in case they don't appear for anyone else reading this thread:



  • blueskypyblueskypy Member ✭✭

    HI, @Geraldine_VdAuwera
    Thanks for the help! I found 90% of the SNP has MAF < 0.004, most of which has small VQSLOD. I believe the overall VQSLOD should improve once those SNPs are removed.

  • blueskypyblueskypy Member ✭✭

    hi, @tommycarstensen
    Thanks for the suggestion! I think the distribution may depends on the VariantRecalibrator -tranche setting. Seems the default 99.9% is too loose, and results in many false positive. I suppose SNPs with very small MAF are more likely to be false positive than those of larger MAF.

  • tommycarstensentommycarstensen United KingdomMember ✭✭✭

    @blueskypy Thanks for the reply. The VQSLOD distribution is not affected by the -tranche setting. It's just a "granularity" setting. The current best practice recommendation for SNPs is 99.5%, if I remember correctly. Yes, as a rule of thumb rare SNPs are more frequently false positives, but I wouldn't filter based on it. It all depends what you need the data for of course. Have fun.

  • blueskypyblueskypy Member ✭✭

    You are right! the tranche won't change the VQSLOD; I meant the filtering based on the tranche changes the distribution.

    Of course, I won't filter based on MAF, but rather on MAC.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Here's the response from one of our mathketeers:

    I am not sure the distribution of vqslods has to be of any form at all. It depends on the concentration of your data points. The p(x) and q(x) are given by the mixture of Gaussian models (positive and negative models), and so is the vqslod. But this is only the model, so the data points might not be (and are often not) truly distributed according to these Gaussians.

    For example suppose you have an asymmetric cluster A for the positive model with side A1 under-represented by the fitted Gaussian and side A2 over-represented. Now suppose you have a similar asymmetric cluster B for the negative model with side B1 under-represented and side B2 over-represented, and suppose A and B are such that A1 overlaps B1. Let's bin the vqslods in small bins so that we can follow "narrow" bands of vqslods in the annotation space. As the vqslods varies from -infinity to +infinity, the bands will move from beyond B2 to the B2 region, to B1, to A1, to A2 and beyond A2. In the B2 region there are a lots of data points so your histogram will show a peak, in the depleted region B1-A1 there are fewer data points so you will see a trough, and finally in the A2 region, which is also over-crowded, you will see another peak. So you will get bimodal distribution. Nothing too wrong with that.

Sign In or Register to comment.