Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
Variant Quality Score Recalibration (VQSR)
VQSR stands for Variant Quality Score Recalibration. In a nutshell, it is a sophisticated filtering technique applied on the variant callset that uses machine learning to model the technical profile of variants in a training set and uses that to filter out probable artifacts from the callset.
Note that this variant recalibration process (VQSR) should not be confused with base recalibration (BQSR), which is a data pre-processing method applied to the read data in an earlier step. The developers who named these methods wish to apologize sincerely to anyone, especially Spanish-speaking users, who get tripped up by the similarity of these names.
- Variant recalibration procedure details
- Interpretation of the Gaussian mixture model plots
- Tranches and the tranche plot
- Resource datasets
What's in a name?
Let's get this out of the way first -- “variant quality score recalibration” is kind of a bad name because it’s not re-calibrating variant quality scores at all; it is calculating a new quality score called the VQSLOD (for variant quality score log-odds) that takes into account various properties of the variant context not captured in the QUAL score. The purpose of this new score is to enable variant filtering in a way that allows analysts to balance sensitivity (trying to discover all the real variants) and specificity (trying to limit the false positives that creep in when filters get too lenient) as finely as possible.
The basic, traditional way of filtering variants is to look at various annotations (context statistics) that describe e.g. what the sequence context is like around the variant site, how many reads covered it, how many reads covered each allele, what proportion of reads were in forward vs reverse orientation; things like that -- then choose threshold values and throw out any variants that have annotation values above or below the set thresholds. The problem with this approach is that it is very limiting because it forces you to look at each annotation dimension individually, and you end up throwing out good variants just because one of their annotations looks bad, or keeping bad variants in order to keep those good variants.
The VQSR method, in contrast, uses machine learning algorithms to learn from each dataset what is the annotation profile of good variants vs. bad variants, and does so in a way that integrates information from multiple dimensions (like, 5 to 8, typically). The cool thing is that this allows us to pick out clusters of variants in a way that frees us from the traditional binary choice of “is this variant above or below the threshold for this annotation?”
Let’s do a quick mental visualization exercise, in two dimensions because our puny human brains work best at that level. Imagine a topographical map of a mountain range, with North-South and East-West axes standing in for two variant annotation scales. Your job is to define a subset of territory that contains mostly mountain peaks, and as few lowlands as possible. Traditional hard-filtering forces you to set a single longitude cutoff and a single latitude cutoff, resulting in one rectangular quadrant of the map being selected, and all the rest being greyed out. It’s about as subtle as a sledgehammer and forces you to make a lot of compromises. VQSR allows you to select contour lines around the peaks and decide how low or how high you want to go to include or exclude territory.
That sounds great! How does it work?
Well, like many things that mention the words "machine learning", it's a bit complicated. The key point is that we use known, highly validated variant resources (omni, 1000 Genomes, hapmap) to select a subset of variants within our callset that we’re really confident are probably true positives (that’s the training set). We look at the annotation profiles of those variants (in our own data!), and we from that we learn some rules about how to recognize good variants. We do something similar for bad variants as well. Then we apply the rules we learned to all of the sites, which (through some magical hand-waving) yields a single score for each variant that describes how likely it is based on all the examined dimensions. In our map analogy this is the equivalent of determining on which contour line the variant sits. Finally, we pick a threshold value indirectly by asking the question “what score do I need to choose so that e.g. 99% of the variants in my callset that are also in hapmap will be selected?”. This is called the target sensitivity. We can twist that dial in either direction depending on what is more important for our project, sensitivity or specificity.
Recalibrate variant types separately!
Due to important differences in how the annotation distributions relate to variant quality between SNPs and indels, we recalibrate them separately. See the Best Practices workflow recommendations for details on how to wire this up.
2. Variant recalibration procedure details
VariantRecalibrator builds the model(s)
The tool takes the overlap of the training/truth resource sets and of your callset. It models the distribution of these variants relative to the annotations you specified, and attempts to group them into clusters. Then it uses the clustering to assign VQSLOD scores to all variants. Variants that are closer to the heart of a cluster will get a higher score than variants that are outliers.
From a more technical point of view, the idea is that we can develop a continuous, covarying estimate of the relationship between variant call annotations (QD, MQ, FS etc.) and the probability that a variant call is a true genetic variant versus a sequencing or data processing artifact. We determine this model adaptively based on "true sites" provided as input (typically HapMap 3 sites and those sites found to be polymorphic on the Omni 2.5M SNP chip array, for humans). We can then apply this adaptive error model to both known and novel variation discovered in the call set of interest to evaluate the probability that each call is real. The VQSLOD score, which gets added to the INFO field of each variant is the log odds ratio of being a true variant versus being false under the trained Gaussian mixture model.
This step produces a recalibration file in VCF format and some accessory files (tranches and plots).
_Note that for workflow efficiency purposes it is possible to split this step in two: (1) run the tool on all the data and output an intermediate recalibration model report, then (2) run the tool again to calculate the VQSLOD scores and write out the recalibration file, tranches and plots. This has the advantage of making it possible to scatter the second part over genomic intervals, to accelerate the process. _
ApplyRecalibration applies a filtering threshold
Here we go again with the not-so-great naming, sorry. This tool doesn't really apply the recalibration model to the callset, since VariantRecalibrator already did that -- that's what the recalibration file contains. Rather, it applies a filtering threshold and writes out who passed and who fails to an output VCF.
During the first part of the recalibration process, variants in your callset were given a score called VQSLOD. At the same time, variants in a truth set were also ranked by VQSLOD. When you specify a tranche sensitivity threshold with ApplyRecalibration, expressed as a percentage (e.g. 99.9%), what happens is that the program looks at what is the VQSLOD value above which 99.9% of the variants in the truth set are included. It then takes that value of VQSLOD and uses it as a threshold to filter your variants. Variants that are above the threshold pass the filter, so the FILTER field will contain PASS. Variants that are below the threshold will be filtered out; they will be written to the output file, but in the FILTER field they will have the name of the tranche they belonged to. So VQSRTrancheSNP99.90to100.00 means that the variant was in the range of VQSLODs corresponding to the remaining 0.1% of the truth set, which are considered false positives. Yes, we accept the possibility that some small number of variant calls in the truth set are wrong...
3. Interpretation of the Gaussian mixture model plots
The variant recalibration step fits a Gaussian mixture model to the contextual annotations given to each variant. By fitting this probability model to the training variants (variants considered to be true-positives), a probability can be assigned to the putative novel variants (some of which will be true-positives, some of which will be false-positives). It is useful for users to see how the probability model was fit to their data. Therefore a modeling report is automatically generated each time VariantRecalibrator is run. For every pairwise combination of annotations used in modeling, a 2D projection of the Gaussian mixture model is shown.
The figure shows one page of an example Gaussian mixture model report that is automatically generated by the VQSR from an example HiSeq call set of SNPs. This page shows the 2D projection of Mapping Quality Rank Sum Test (MQRankSum) versus Haplotype Score (HS) by marginalizing over the other annotation dimensions in the model.
Note that this is an old example that uses an annotation, Haplotype Score, that has been deprecated and is no longer available. However, all the points made in the description are still valid.
In each page there are four panels which show different ways of looking at the 2D projection of the model. The upper left panel shows the probability density function that was fit to the data. The 2D projection was created by marginalizing over the other annotation dimensions in the model via random sampling. Green areas show locations in the space that are indicative of being high quality while red areas show the lowest probability areas. In general putative SNPs that fall in the red regions will be filtered out of the recalibrated call set.
The remaining three panels give scatter plots in which each SNP is plotted in the two annotation dimensions as points in a point cloud. The scale for each dimension is in normalized units. The data for the three panels is the same but the points are colored in different ways to highlight different aspects of the data. In the upper right panel SNPs are colored black and red to show which SNPs are retained and filtered, respectively, by applying the VQSR procedure. The red SNPs didn't meet the given truth sensitivity threshold and so are filtered out of the call set. The lower left panel colors SNPs green, grey, and purple to give a sense of the distribution of the variants used to train the model. The green SNPs are those which were found in the training sets passed into the VariantRecalibrator step, while the purple SNPs are those which were found to be furthest away from the learned Gaussians and thus given the lowest probability of being true. Finally, the lower right panel colors each SNP by their known/novel status with blue being the known SNPs and red being the novel SNPs. Here the idea is to see if the annotation dimensions provide a clear separation between the known SNPs (most of which are true) and the novel SNPs (most of which are false).
An example of good clustering for SNP calls from the tutorial dataset is shown to the right. The plot shows that the training data forms a distinct cluster at low values for each of the two statistics shown. As the SNPs fall off the distribution in either one or both of the dimensions they are assigned a lower probability (that is, move into the red region of the model's PDF) and are filtered out. This makes sense based on what the annotated statistics mean; for example the higher values for mapping quality bias indicate more evidence of bias between the reference bases and the alternative bases. The model has captured our intuition that this area of the distribution is highly enriched for machine artifacts and putative variants here should be filtered out.
4. Tranches and the tranche plot
The VQSLOD score provides a continuous estimate of the probability that each variant is true, allowing one to partition the call sets into quality tranches. The main purpose of the tranches is to establish thresholds within your data that correspond to certain levels of sensitivity relative to the truth sets. The idea is that with well calibrated variant quality scores, you can generate call sets in which each variant doesn't have to have a hard answer as to whether it is in or out of the set. If a very high accuracy call set is desired then one can use the highest tranche, but if a larger, more complete call set is a higher priority than one can dip down into lower and lower tranches. These tranches are applied to the output VCF file using the FILTER field. In this way you can choose to use some of the filtered records or only use the PASSing records.
The first tranche (90 by default, but you can specify your own values) has the lowest value of truth sensitivity but the highest value of novel Ti/Tv; it is extremely specific (almost exclusively real variants in here) but less sensitive (it's missing a lot). From there, each subsequent tranche introduces additional true positive calls along with a growing number of false positive calls.
A plot of the tranches is automatically generated for SNPs by the VariantRecalibrator tool if you have the requisite R dependencies installed; an example is shown below.
This is an example of a tranches plot generated for a HiSeq call set. The x-axis gives the number of novel variants called while the y-axis shows two quality metrics -- novel transition to transversion (TiTv) ratio and the overall truth sensitivity. Remember that specificity decreases as the truth sensitivity increases.
The tranches plot is not applicable for indels and will not be generated when the tool is run in INDEL mode.
5. Resource datasets
This procedure relies heavily on the availability of good resource datasets of the following types:
This resource must be a call set that has been validated to a very high degree of confidence. The program will consider that the variants in this resource are representative of true sites (truth=true), and will use them to train the recalibration model (training=true). We will also use these sites later on to choose a threshold for filtering variants based on sensitivity to truth sites.
This resource must be a call set that has been validated to some degree of confidence. The program will consider that the variants in this resource may contain false positives as well as true variants (truth=false), and will use them to train the recalibration model (training=true).
Known sites resource
This resource can be a call set that has not been validated to a high degree of confidence (truth=false). The program will not use the variants in this resource to train the recalibration model (training=false). However, the program will use these to stratify output metrics such as Ti/Tv ratio by whether variants are present in dbsnp or not (known=true).
Obtaining appropriate resources
The human genome training, truth and known resource datasets that are used in our Best Practices workflow applied to human data are all available from our Resource Bundle.
If you are working with non-human genomes, you will need to find or generate at least truth and training resource datasets with properties as described above. You'll also need to assign your set a prior likelihood that reflects your confidence in how reliable it is as a truth set. We recommend Q10 as a starting value, which you can then experiment with to find the most appropriate value empirically. There are many possible avenues of research here. Hopefully the model reporting plots that are generated by the recalibration tools will help facilitate this experimentation.
VariantRecalibrator is like a teenager, frequently moody and uncommunicative. There are many things that can put this tool in a bad mood and cause it to fail. When that happens, it rarely provides a useful error message. We're very sorry about that and we're working hard on developing a new tool (based on deep learning) that will be more stable and user-friendly. In the meantime though, here are a few things to watch out for.
Greedy for data
This tool expects large numbers of variant sites in order to achieve decent modeling with the Gaussian mixture model. It's difficult to put a hard number on mimimum requirements because it depends a lot on the quality of the data (clean, well-behaved data requires fewer sites because the clustering tends to be less noisy), but empirically we find that in humans, the procedure tends to work well enough with at least one whole genome or 30 exomes. Anything smaller than that scale is likely to run into difficulties, especially for the indel recalibration.
If you don't have enough of your own samples, consider using publicly available data (e.g. exome bams from the 1000 Genomes Project) to "pad" your cohort. Be aware however that you cannot simply merge in someone else's variant calls. You must joint-call variants from the original BAMs with your own samples. We recommend using the GVCF workflow to generate GVCFs from the original BAMs, and joint-call those GVCFs along with your own samples' GVCFs using GenotypeGVCFs.
One other thing that can help is to turn down the number of Gaussians used during training. This can be accomplished by adding
--maxGaussians 4 to your command line. This controls the maximum number of different "clusters" (=Gaussians) of variants the program is allowed to try to identify. Lowering this number forces the program to group variants into a smaller number of clusters, which means there will be more variants in each cluster -- hopefully enough to satisfy the statistical requirements. Of course, this decreases the level of discrimination that you can achieve between variant profiles/error modes. It's all about trade-offs; and unfortunately if you don't have a lot of variants you can't afford to be very demanding in terms of resolution.
Annotations are tricky
VariantRecalibrator assumes that the distribution of annotation values is gaussianly distributed, but we know this assumption breaks down for some annotations. For example, mapping quality has a very different distribution because it is not a calibrated statistic, so in some cases it can destabilize the model. When you run into trouble, excluding MQ from the list of annotations can be helpful.
In addition, some of the annotations included in our standard VQSR recommendations might not be the best for your particular dataset. In particular, the following caveats apply:
Depth of coverage (the DP annotation invoked by Coverage) should not be used when working with exome datasets since there is extreme variation in the depth to which targets are captured. In whole genome experiments this variation is indicative of error but that is not the case in capture experiments.
*InbreedingCoeff** is a population level statistic that requires at least 10 samples in order to be computed. For projects with fewer samples, or that includes many closely related samples (such as a family) please omit this annotation from the command line.
Plot generation depends on having Rscript accessible in your environment path. Rscript is the command line version of R that allows execution of R scripts from the command line. We also make use of the ggplot2 library so please be sure to install that package as well. See the Common Problems section of the Guide for more details.