VQSLOD

We are doing WES on hundreds of samples and the average sequencing depth is 60X. I used the sensitivity 99.9% for the PASS filter in VQSR. Attached is the histagram of VQSLOD for around 900,000 SNPs with PASS filter and call rate > 95%. I wonder if you can help me with two questions?
1. is it normal to have half of the SNPs with VQSLOD < 0?
2. why so few SNPs around VQSLOD = 0?
Answers
A 2nd question, I'm trying to use the following command to extract SNPs of VQSLOD > 0:
but got this error:
because the 1st SNP has "
VQSLOD=-3.415e+00
"@blueskypy I can help your with your 2nd/3rd question. You are meant to apply the variant recalibration/filtering with ApplyRecalibration, if you follow the default practices. VQSR is a two (four with indels) step process, if you follow the best practices. Here is the VQSR FAQ:
https://www.broadinstitute.org/gatk/guide/article?id=1259
I'm just an end user like you. Someone smart will probably illuminate us on your questions 1a and 1b
I had never noticed the bimodal distribution myself...
Oh, I forgot to mention. ApplyRecalibration takes TruthSensitivity Ratios as input
--ts_filter_level
, so you need to look in your tranches files, which threshold is equivalent to aVQSLOD
of 0. Does that make sense?I have never noticed this before, but there seems to be an
--lodCutoff
option available for AR. I wouldn't recommend filtering based onVQSLOD
values, but I could be wrong...Hi @blueskypy and @tommycarstensen
I believe the bimodal distribution of the VQSLOD is expected because it results from taking the ratio of the negative and positive model values for the site. I'm not sure if the split is expected to be 50/50 in all cases, but I wouldn't be surprised if it was -- I'll ask the mathketeers.
We don't encourage filtering directly on VQSLOD because it takes away your ability to derive the tradeoff between sensitivity and specificity from how the modeling performed on a gold standard (eg hapmap).
@blueskypy I'm not convinced your distributions from which the VQSLOD values are calculated is normal then. I tried plotting a histogram of the log(p(x)/q(x)) of various distributions p(x) and q(x) and I didn't see anything bimodal. @Geraldine_VdAuwera I'm actually interested in hearing what the mathketeers have to say, if you still have any of them working there.
P.S. Geraldine, I think I have discovered how to avoid a forum bug, which has annoyed me for a long time.
I can see the figure posted by @blueskypy if I follow this link to the thread:
http://gatkforums.broadinstitute.org/discussion/5869/vqslod
I can't see it (but I can see yours), if I follow this link to the most recent comment in the thread:
http://gatkforums.broadinstitute.org/discussion/comment/22846
I think this is the reason you and @Sheila often report that you can't see my attachments. The forum is cool, but no flawless victory here...
Here is the hard link to your plots in case they don't appear for anyone else reading this thread:
https://us.v-cdn.net/5019796/uploads/FileUpload/e1/ceee6c07f7b4ea2a4b0567bba66ba7.png
https://us.v-cdn.net/5019796/uploads/FileUpload/46/3a504c5d8166808b56ad53ca1384d1.png
HI, @Geraldine_VdAuwera
Thanks for the help! I found 90% of the SNP has MAF < 0.004, most of which has small VQSLOD. I believe the overall VQSLOD should improve once those SNPs are removed.
hi, @tommycarstensen
Thanks for the suggestion! I think the distribution may depends on the VariantRecalibrator -tranche setting. Seems the default 99.9% is too loose, and results in many false positive. I suppose SNPs with very small MAF are more likely to be false positive than those of larger MAF.
@blueskypy Thanks for the reply. The VQSLOD distribution is not affected by the
-tranche
setting. It's just a "granularity" setting. The current best practice recommendation for SNPs is 99.5%, if I remember correctly. Yes, as a rule of thumb rare SNPs are more frequently false positives, but I wouldn't filter based on it. It all depends what you need the data for of course. Have fun.@tommycarstensen
You are right! the tranche won't change the VQSLOD; I meant the filtering based on the tranche changes the distribution.
Of course, I won't filter based on MAF, but rather on MAC.
Here's the response from one of our mathketeers:
thanks @Geraldine_VdAuwera