Does VQSR filter out known variants with low score?
Does VQSR filter out known variants with low score or preserve them?
Best Answer

Geraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie
Hi Alexey,
We realize that VQSR is a very complex method and that our written documentation does not address all questions, for sure. We're working on remediating that. Quick question though: have you watched the videos from the last workshop? We went into more detail of how resources are used than previously. The videos are available here:
http://www.broadinstitute.org/partnerships/education/broade/bestpracticesvariantcallinggatk1
And the VQSR video in particular is at http://www.broadinstitute.org/node/6700
A few quick answers to your specific questions:
Re: known/unknown, it is indeed whether a variant is novel (= has never reported before). The only way that feature is actually used directly, is for building the tranche plots, which uses ti/tv ratio of novel variants to estimate the rate of false positives vs true positives. But it is not used as such in any of the calculations that produce the VQSLOD.
The priors are used for building the training model; they provide an estimate of how likely the training variants are to be true vs false.
We do not trust dbsnp at all. There are just too many false positives to use dbsnp as a training resource.
Answers
VQSR is a two step process consisting of VariantRecalibrator (VR) and ApplyRecalibration (AR). VR assigns a score to each variant. AR weeds out variants with low scores based on a truth sensitivity threshold chosen by you. A good place to start is here.
Dear tommycarstensen
Thank you for the quick reply.
However, I have not asked about a place to start. I have asked a specific question.
Could you help with my question: how exactly VQSR deals with low quality known variants?
Many thanks again, Alexey
Alexey, let me ask you a specific question. Have you read the VQSR documentation, including the one Tommy pointed you to?
We ask you to start by reading the documentation because most of the time, you will find it answers your question. We write those documents so that we won't have to answer the same questions 25 thousand times (which is the current number of registered users of GATK).
If you find your question is not fully addressed in the docs, let us know what part you think is incomplete, and we'll try to fill in the gaps.
Issue · Github
by Sheila
@alexey_larionov
Hi,
Although we input a few different "known" resource files, we cannot assume that all known variants are necessarily true, because we know that some variants that have been reported previously were in fact false positives. The known resource files have different uses, which are described below.
VQSR uses the training set (the resources you set as training=true) to build a model based on annotations. Based on that model, it assigns a VQSLOD score to each of the sites in your VCF. The tranche threshold you specify determines what percentage of true sites (from the resources you set as truth=true) you wish to consider as true positives.
For example, let's say you set the tranche threshold to 95. VQSR will find the VQSLOD that 95% of your truth set variants have a score greater than or equal to. The variants in your VCF that have that VQSLOD or higher will PASS, and the variants in your VCF that have a lower VQSLOD will be filtered.
So, the only way to keep all of the truth set variants is to set the tranche threshold to 100. Also, note that dbSNP is not used in the training set or the truth set. dbSNP is used to estimate known vs. novel rates.
I hope this helps.
Sheila
Dear tommycarstensen, Geraldine_VdAuwera and Sheila ,
Thank you very much again for your replies. Please be assured that I carefully read the documentation before asking questions, and that I appreciate your great effort to support users like me. I agree that given the complexity of calculations, even after reading documentation, I may miss some basics. Therefore, referring to fundamentals always has its merit...
Now back to specific questions
Sheila’s reply confirmed my feeling that pass/fail filter is set solely on the basis of VQSLOD, ignoring the known/novel status. However, I was not sure. The doubt was caused by the fact that the tranches plot shows novel variants only. Potentially this is consistent with a way, when knowns are used to build the model, then the model is applied to new variants only, while the knowns are included anyway. Such approach might make sense too. However, Sheila’s reply confirms that GATK uses an alternative one: it accounts for the novelty when calculating VQSLOD, not when it applies the filters.
With this regards I wonder whether I could ask for more explanations. In particular, I do not fully understand the meaning of “known” and “prior” parameters in VQSR resources.
My understanding is that VQSLOD is derived from the likelihood denoted L( vi  GMM) in your Nature Genetics 2011 paper (DePristo et al, PMID 21478889, online methods supplement). This likelihood consists of the prior term and the model term. The model term reflects the distance from the cluster(s) centre(s). This term is very well explained in your documentation, while the prior term is given much less focus. The paper says that prior term includes a novelty component, defined as 97% if variant is in HapMap3 and 37% otherwise. This is where I am losing the track:
VQSR accepts more resources than HapMap. Are other resources included in calculation of the prior at this stage?
If no: how their “prior” parameters are used?
If yes: are all resources included into prior calculation automatically, or is the inclusion controlled by parameters “known” and “prior”?
I guess that there is no relation between parameter “known” and inclusion of the resource into calculation at this stage. However, semantically “known/unknown” sounds somehow related to “novelty”. Could you explicitly confirm whether this my guess is correct or not?
If the guess is correct, then what exactly the “known” parameter is used for? My understanding is that it is used to select resource(s) for calculating expected TiTv. Is this correct? Any other uses of the “known” parameter?
Finally, the Nature Genetics 2011 paper sets prior for unknown variant as 37%. Apparently it is equal to the prior for dbsnp ( https://www.broadinstitute.org/gatk/guide/article?id=1259 ). Does this mean that you do not trust to dbsnp entirely?
Sorry for so many questions, I hope that they make sense.
And I thank you very much again for your help and patience.
Hi Alexey,
We realize that VQSR is a very complex method and that our written documentation does not address all questions, for sure. We're working on remediating that. Quick question though: have you watched the videos from the last workshop? We went into more detail of how resources are used than previously. The videos are available here:
http://www.broadinstitute.org/partnerships/education/broade/bestpracticesvariantcallinggatk1
And the VQSR video in particular is at http://www.broadinstitute.org/node/6700
A few quick answers to your specific questions:
Re: known/unknown, it is indeed whether a variant is novel (= has never reported before). The only way that feature is actually used directly, is for building the tranche plots, which uses ti/tv ratio of novel variants to estimate the rate of false positives vs true positives. But it is not used as such in any of the calculations that produce the VQSLOD.
The priors are used for building the training model; they provide an estimate of how likely the training variants are to be true vs false.
We do not trust dbsnp at all. There are just too many false positives to use dbsnp as a training resource.
Thanks again for your explanations. The recent videos are as good as usually; they perfectly explain your GMM and the algorithm for fitting.
As a feedback, not as a continuation of my question:
The Nature 2011 paper has a formula that looks like the following
L(vi  GMM) = Pr{ vi } Pr{ Vi  GMM}
(sorry for such way of referencing the formula, but there is no numbering of formulas in the paper)
The second term on the right part of this formula seems to be the model ( Pr { Vi  GMM } ).
And the first term is the prior expectation that depends on (i) alt allele count and on (ii) being in HapMap3.
It seems to me that the step described in this formula is omitted in your presentations.
As well calling the score "log odds" might be misleading for some users. Sometime odds are defined as ratio of mutually exclusive events from the same probability distribution. In VQSR case it seems, that semantically being FP and FN is mutually exclusive, but mathematically you estimate them from different distributions.
I admit that this level of details may have little practical implication and may takes more time than it worth.
Thank you very much again for your effort to help.
@alexey_larionov
Hi,
Thanks for the suggestions. I am happy someone is interested in the math! We always have a hard time deciding how much math to include in the presentations, as many users simply care how the tools work on a high level. I will keep in mind your post next time I want to include more math.
As for the log odds issue, I think this thread between blueskypy and Geraldine will help: http://gatkforums.broadinstitute.org/discussion/comment/21579/#Comment_21579
Sheila