GATK licensing moves to direct-through-Broad model -- read about it on the GATK blog

Variant Quality Score Recalibration (VQSR)

rpoplinrpoplin Posts: 122GATK Dev mod

This document describes what Variant Quality Score Recalibration (VQSR) is designed to do, and outlines how it works under the hood. For command-line examples and recommendations on what specific resource datasets and arguments to use for VQSR, please see this FAQ article.

As a complement to this document, we encourage you to watch the workshop videos available on our Events webpage.

Slides that explain the VQSR methodology in more detail as well as the individual component variant annotations can be found here in the GSA Public Drop Box.

Detailed information about command line options for VariantRecalibrator can be found here.

Detailed information about command line options for ApplyRecalibration can be found here.


Introduction

The purpose of variant recalibration is to assign a well-calibrated probability to each variant call in a call set. This enables you to generate highly accurate call sets by filtering based on this single estimate for the accuracy of each call.

The approach taken by variant quality score recalibration is to develop a continuous, covarying estimate of the relationship between SNP call annotations (QD, SB, HaplotypeScore, HRun, for example) and the the probability that a SNP is a true genetic variant versus a sequencing or data processing artifact. This model is determined adaptively based on "true sites" provided as input (typically HapMap 3 sites and those sites found to be polymorphic on the Omni 2.5M SNP chip array, for humans). This adaptive error model can then be applied to both known and novel variation discovered in the call set of interest to evaluate the probability that each call is real. The score that gets added to the INFO field of each variant is called the VQSLOD. It is the log odds ratio of being a true variant versus being false under the trained Gaussian mixture model.

The variant recalibrator contrastively evaluates variants in a two step process, each performed by a distinct tool:

  • VariantRecalibrator
    Create a Gaussian mixture model by looking at the annotations values over a high quality subset of the input call set and then evaluate all input variants. This step produces a recalibration file.

  • ApplyRecalibration
    Apply the model parameters to each variant in input VCF files producing a recalibrated VCF file in which each variant is annotated with its VQSLOD value. In addition, this step will filter the calls based on this new lod score by adding lines to the FILTER column for variants that don't meet the specified lod threshold.

Please see the VQSR tutorial for step-by-step instructions on running these tools.

How VariantRecalibrator works in a nutshell

The tool takes the overlap of the training/truth resource sets and of your callset. It models the distribution of these variants relative to the annotations you specified, and attempts to group them into clusters. Then it uses the clustering to assign VQSLOD scores to all variants. Variants that are closer to the heart of a cluster will get a higher score than variants that are outliers.

How ApplyRecalibration works in a nutshell

During the first part of the recalibration process, variants in your callset were given a score called VQSLOD. At the same time, variants in your training sets were also ranked by VQSLOD. When you specify a tranche sensitivity threshold with ApplyRecalibration, expressed as a percentage (e.g. 99.9%), what happens is that the program looks at what is the VQSLOD value above which 99.9% of the variants in the training callset are included. It then takes that value of VQSLOD and uses it as a threshold to filter your variants. Variants that are above the threshold pass the filter, so the FILTER field will contain PASS. Variants that are below the threshold will be filtered out; they will be written to the output file, but in the FILTER field they will have the name of the tranche they belonged to. So VQSRTrancheSNP99.90to100.00 means that the variant was in the range of VQSLODs corresponding to the remaining 0.1% of the training set, which are basically considered false positives.


Interpretation of the Gaussian mixture model plots

The variant recalibration step fits a Gaussian mixture model to the contextual annotations given to each variant. By fitting this probability model to the training variants (variants considered to be true-positives), a probability can be assigned to the putative novel variants (some of which will be true-positives, some of which will be false-positives). It is useful for users to see how the probability model was fit to their data. Therefore a modeling report is automatically generated each time VariantRecalibrator is run (in the above command line the report will appear as path/to/output.plots.R.pdf). For every pair-wise combination of annotations used in modeling, a 2D projection of the Gaussian mixture model is shown.

image

The figure shows one page of an example Gaussian mixture model report that is automatically generated by the VQSR from the example HiSeq call set. This page shows the 2D projection of mapping quality rank sum test versus Haplotype score by marginalizing over the other annotation dimensions in the model.

In each page there are four panels which show different ways of looking at the 2D projection of the model. The upper left panel shows the probability density function that was fit to the data. The 2D projection was created by marginalizing over the other annotation dimensions in the model via random sampling. Green areas show locations in the space that are indicative of being high quality while red areas show the lowest probability areas. In general putative SNPs that fall in the red regions will be filtered out of the recalibrated call set.

The remaining three panels give scatter plots in which each SNP is plotted in the two annotation dimensions as points in a point cloud. The scale for each dimension is in normalized units. The data for the three panels is the same but the points are colored in different ways to highlight different aspects of the data. In the upper right panel SNPs are colored black and red to show which SNPs are retained and filtered, respectively, by applying the VQSR procedure. The red SNPs didn't meet the given truth sensitivity threshold and so are filtered out of the call set. The lower left panel colors SNPs green, grey, and purple to give a sense of the distribution of the variants used to train the model. The green SNPs are those which were found in the training sets passed into the VariantRecalibrator step, while the purple SNPs are those which were found to be furthest away from the learned Gaussians and thus given the lowest probability of being true. Finally, the lower right panel colors each SNP by their known/novel status with blue being the known SNPs and red being the novel SNPs. Here the idea is to see if the annotation dimensions provide a clear separation between the known SNPs (most of which are true) and the novel SNPs (most of which are false).

An example of good clustering for SNP calls from the tutorial dataset is shown to the right. The plot shows that the training data forms a distinct cluster at low values for each of the two statistics shown (haplotype score and mapping quality bias). As the SNPs fall off the distribution in either one or both of the dimensions they are assigned a lower probability (that is, move into the red region of the model's PDF) and are filtered out. This makes sense as not only do higher values of HaplotypeScore indicate a lower chance of the data being explained by only two haplotypes but also higher values for mapping quality bias indicate more evidence of bias between the reference bases and the alternative bases. The model has captured our intuition that this area of the distribution is highly enriched for machine artifacts and putative variants here should be filtered out!


Tranches and the tranche plot

The recalibrated variant quality score provides a continuous estimate of the probability that each variant is true, allowing one to partition the call sets into quality tranches. The main purpose of the tranches is to establish thresholds within your data that correspond to certain levels of sensitivity relative to the truth sets. The idea is that with well calibrated variant quality scores, you can generate call sets in which each variant doesn't have to have a hard answer as to whether it is in or out of the set. If a very high accuracy call set is desired then one can use the highest tranche, but if a larger, more complete call set is a higher priority than one can dip down into lower and lower tranches. These tranches are applied to the output VCF file using the FILTER field. In this way you can choose to use some of the filtered records or only use the PASSing records.

The first tranche (from the bottom, with the highest value of truth sensitivity but usually the lowest values of novel Ti/Tv) is exceedingly specific but less sensitive, and each subsequent tranche in turn introduces additional true positive calls along with a growing number of false positive calls. Downstream applications can select in a principled way more specific or more sensitive call sets or incorporate directly the recalibrated quality scores to avoid entirely the need to analyze only a fixed subset of calls but rather weight individual variant calls by their probability of being real. An example tranche plot, automatically generated by the VariantRecalibrator walker, is shown below.

image

This is an example of a tranches plot generated for a HiSeq call set. The x-axis gives the number of novel variants called while the y-axis shows two quality metrics -- novel transition to transversion ratio and the overall truth sensitivity.

Note that the tranches plot is not applicable for indels and will not be generated when the tool is run in INDEL mode.


Ti/Tv-free recalibration

We use a Ti/Tv-free approach to variant quality score recalibration. This approach requires an additional truth data set, and cuts the VQSLOD at given sensitivities to the truth set. It has several advantages over the Ti/Tv-targeted approach:

  • The truth sensitivity (TS) approach gives you back the novel Ti/Tv as a QC metric
  • The truth sensitivity (TS) approach is conceptual cleaner than deciding on a novel Ti/Tv target for your dataset
  • The TS approach is easier to explain and defend, as saying "I took called variants until I found 99% of my known variable sites" is easier than "I took variants until I dropped my novel Ti/Tv ratio to 2.07"

We have used hapmap 3.3 sites as the truth set (genotypes_r27_nr.b37_fwd.vcf), but other sets of high-quality (~99% truly variable in the population) sets of sites should work just as well. In our experience, with HapMap, 99% is a good threshold, as the remaining 1% of sites often exhibit unusual features like being close to indels or are actually MNPs, and so receive a low VQSLOD score.
Note that the expected Ti/Tv is still an available argument but it is only used for display purposes.


Finally, a couple of Frequently Asked Questions

- Can I use the variant quality score recalibrator with my small sequencing experiment?

This tool is expecting thousands of variant sites in order to achieve decent modeling with the Gaussian mixture model. Whole exome call sets work well, but anything smaller than that scale might run into difficulties.

One piece of advice is to turn down the number of Gaussians used during training. This can be accomplished by adding --maxGaussians 4 to your command line.

maxGaussians is the maximum number of different "clusters" (=Gaussians) of variants the program is "allowed" to try to identify. Lowering this number forces the program to group variants into a smaller number of clusters, which means there will be more variants in each cluster -- hopefully enough to satisfy the statistical requirements. Of course, this decreases the level of discrimination that you can achieve between variant profiles/error modes. It's all about trade-offs; and unfortunately if you don't have a lot of variants you can't afford to be very demanding in terms of resolution.

- Why don't all the plots get generated for me?

The most common problem related to this is not having Rscript accessible in your environment path. Rscript is the command line version of R that gets installed right alongside. We also make use of the ggplot2 library so please be sure to install that package as well. See the Common Problems section of the Guide for more details.

Post edited by Geraldine_VdAuwera on

Comments

  • KurtKurt Posts: 207Member ✭✭✭

    @mhernaez ,

    I would suggest starting off with removing MQ from the model. that's usually been my achilles heel when dealing with targeted capture. I've had MQRankSum bite me a couple of times (for a different reason).

  • mhernaezmhernaez StanfordPosts: 5Member

    @Kurt Thanks! I will try that.

    mikel

  • SteveLSteveL BarcelonaPosts: 25Member

    I found this sentence VERY confusing, if not wrong (I am still not 100% sure):

    The first tranche (from the bottom, with lowest values) is exceedingly specific but less sensitive, and each subsequent tranche in turn introduces additional true positive calls along with a growing number of false positive calls.

    Should it not be the first tranche from the TOP? i.e. in tranche-90 everything is considered TP (at least in this figure).

  • tommycarstensentommycarstensen United KingdomPosts: 284Member ✭✭✭

    @SteveL No, starting from "tranche-90" and going up your specificity will decrease and your sensitivity will increase. The sentence is correct.

  • SteveLSteveL BarcelonaPosts: 25Member

    Thanks @tommycarstensen - however the issue I have is that the image (immediately following paragraph with quoted sentence) on page one of this thread has the "tranche-90" at the top, so it isn't clear, to novices, what the word "bottom" refers to - also the "lowest" values, could be interpreted as tranche or the ti/tv, which unfortunately go in opposing directions. I just think it could be stated more clearly, but maybe it's just me.

    Issue · Github
    by Geraldine_VdAuwera

    Issue Number
    20
    State
    closed
    Last Updated
    Closed By
    vdauwera
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 7,781Administrator, GATK Dev admin

    @SteveL I see what you mean, it's a fair point. We'll try to clarify.

    Geraldine Van der Auwera, PhD

  • shiroannshiroann Posts: 6Member

    Is the model better informed by using a VCF containing all the chromosomes or does the model still work efficiently when the recalibration is done per chromosome?

  • tommycarstensentommycarstensen United KingdomPosts: 284Member ✭✭✭

    @shiroann It is better to maximize the number of data points by running simultaneously across all chromosomes. That being said I think VQSR still might give reasonable results for individual chromosomes.

  • shiroannshiroann Posts: 6Member
    edited April 21

    Thanks @tommycarstensen . I'm assuming the per chromosome VQSR will be greatly influenced by the number of variants in that particular chromosome.I wonder if that introduces some underlying variation in the VQSLOD score sensitivity across the genome/exome?

    Post edited by shiroann on
  • tommycarstensentommycarstensen United KingdomPosts: 284Member ✭✭✭
    edited April 21

    @shiroann Yes, basically don't run VQSR per chromosome, unless you need the results quickly, before the Death Star reaches Alderaan or something like that. And you would have to run ApplyRecalibration separately per chromosome as well, if you were to do it. JUST don't DO IT (not a shoe company TM).

    ... Unless you are dealing with chromosomes, which you think might have annotations very different from your other chromosomes. I think I remember @Geraldine_VdAuwera telling me to run across all chromosomes irrespective of these differences. Here is the thread I was thinking of:

    http://gatkforums.broadinstitute.org/discussion/2895/vqsr-and-sex-chromosomes
    

    I posted a figure in that thread showing the the chromY variants more frequently fail to the DP annotation for obvious reasons:

    http://cd8ba0b44a15c10065fd-24461f391e20b7336331d5789078af53.r23.cf1.rackcdn.com/gatk.vanillaforums.com/FileUpload/99/99c914e1b209e5615f9de0c07ca55f.png
    
    Post edited by tommycarstensen on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 7,781Administrator, GATK Dev admin

    We really don't recommend running VQSR per chromosome, no. It's going to cause what are essentially batch effects between chromosomes of different sizes, among other issues.

    Geraldine Van der Auwera, PhD

  • WVNicholsonWVNicholson Warwick University, CoventryPosts: 14Member

    I posted a question about VQSR and issues with the VariantRecalibrator and VariantAnnotator about two weeks ago; but it still hasn't appeared anywhere so I'm concerned that it has been blocked outright. I'm not re-posting it here right now; because it's possible it was blocked earlier due to my inadvertently making multiple posts,

    William

  • SheilaSheila Broad InstitutePosts: 1,241Member, GATK Dev, Broadie, Moderator, DSDE Member admin

    @WVNicholson
    Hi William.

    I just verified your account. I don't see your question anywhere, so it is best to re-post again. I should be able to see it now.

    -Sheila

  • KathKath Posts: 43Member

    Hi,

    This might have been asked before (sorry...) but if I have less than 30 exome samples and matching 1000 genomes samples is going to be difficult (because I am running my samples through my pipeline and samples will differ each time according to enrichment method, intervals, ethnicity etc.) is it better to leave out the VQSR step completely or carry on using it on <30 samples?

    Thanks

    Kath

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 7,781Administrator, GATK Dev admin

    @Kath I would recommend running some tests and evaluating your results. Generally speaking we think it's worth the additional effort to put together a VQSR-worthy dataset.

    Geraldine Van der Auwera, PhD

  • WVNicholsonWVNicholson Warwick University, CoventryPosts: 14Member

    As mentioned earlier, I attempted to post the message below before; so it could actually still be in the system somewhere. (In particular, I sent it to "Ask the GATK Team" which may be distinct from the forums.)

    I'm trying to carry out VQSR on a dataset. When I run the VariantRecalibrator it fails to find annotations, even when added either by running the VariantAnnotator or by re-running GenotypeGVCFs with the arguments for the annotations to be generated. (The variants were originally called using the HaplotypeCaller.) My most recent command line for the VariantRecalibrator is:

    java -jar /home/u1374090/software/gatk/gatk_only/GenomeAnalysisTK.jar \
    -T VariantRecalibrator -R ../ncbi/ncbi_sorghum_combo_genome.fasta \
    -input S_all_modern_combo_jointan.vcf \
    -resource:ncbi_variations,known=false,training=true,truth=true,prior=10.0 ../ncbi/variations/nstd63_Zheng_et_al_2011.Sorbi1.submitted.variant_call.germline.wvned.vcf \
    -an DP -an FS -an MQRankSum \
    -mode SNP \
    -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 \
    -recalFile recalibrate_SNP.recal \
    -tranchesFile recalibrate_SNP.tranches \
    -nt ${ntPara}

    The shell variable ntPara was set to "10". I've tried several variations of the above command line, all without success. What happens is that the final .recal and .tranches file are empty and I get the following output including an error message (which I've edited because it is quite long):

    INFO 17:36:29,136 HelpFormatter - Program Args: -T VariantRecalibrator -R ../ncbi/ncbi_sorghum_combo_genome.fasta -input S_all_modern_combo_jointan.vcf -resource:ncbi_variations,known=false,training=true,truth=true,prior=10.0 ../ncbi/variations/nstd63_Zheng_et_al_2011.Sorbi1.submitted.variant_call.germline.wvned.vcf -an DP -an FS -an MQRankSum -mode SNP -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 -recalFile recalibrate_SNP.recal -tranchesFile recalibrate_SNP.tranches -nt 10
    INFO 17:36:29,141 HelpFormatter - Executing as u1374090@hamilton on Linux 3.13.0-46-generic amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_25-b17.
    .
    .
    .
    INFO 17:36:29,715 GenomeAnalysisEngine - Strictness is SILENT
    INFO 17:36:29,832 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000
    INFO 17:36:29,951 MicroScheduler - Running the GATK in parallel mode with 10 total threads, 1 CPU thread(s) for each of 10 data thread(s), of 64 processors available on this machine
    INFO 17:36:30,027 GenomeAnalysisEngine - Preparing for traversal
    INFO 17:36:30,030 GenomeAnalysisEngine - Done preparing for traversal
    INFO 17:36:30,031 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
    INFO 17:36:30,031 ProgressMeter - | processed | time | per 1M | | total | remaining
    INFO 17:36:30,031 ProgressMeter - Location | sites | elapsed | sites | completed | runtime | runtime
    INFO 17:36:30,035 TrainingSet - Found ncbi_variations track: Known = false Training = true Truth = true Prior = Q10.0
    INFO 17:36:48,155 VariantDataManager - DP: mean = NaN standard deviation = NaN
    INFO 17:36:51,231 GATKRunReport - Uploaded run statistics report to AWS S3

    ERROR ------------------------------------------------------------------------------------------
    ERROR A USER ERROR has occurred (version 3.3-0-g37228af):

    .
    .
    .

    ERROR MESSAGE: Bad input: Values for DP annotation not detected for ANY training variant in the input callset. VariantAnnotator may be used to add these annotations. See http://gatkforums.broadinstitute.org/discussion/49/using-variant-annotator
  • SheilaSheila Broad InstitutePosts: 1,241Member, GATK Dev, Broadie, Moderator, DSDE Member admin

    @WVNicholson
    Hi,

    It looks like your training file may not have the DP annotation in it. Can you check and make sure it has it?

    -Sheila

  • WVNicholsonWVNicholson Warwick University, CoventryPosts: 14Member

    Indeed it looks like my training (VCF) file does not have the DP annotation or apparently any annotations that would be useful for VQSR. I'm still looking; but at this stage it's possibly none of the publicly available sets of known SNPs, at least as available in VCF format actually have the required annotations. It appears I may be able to get data in SRA format (apparently convertible to BAM) for some of the SNPs in the VCF file from Ensembl Plants. So presumably I can make my own annotated VCF from that one and the SRA files (after converting to BAM) by using VariantAnnotator. This looks like it's going to be an involved process; so I thought I should check if that's a sensible course of action?

    William

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 7,781Administrator, GATK Dev admin

    @WVNicholson There seems to have been a miscommunication in our team about what is required for VQSR. It is not necessary for the training resource file to contain any annotations; the program only uses the location of the sites from that file.

    Geraldine Van der Auwera, PhD

  • WVNicholsonWVNicholson Warwick University, CoventryPosts: 14Member

    Okay, I'm not sure where to go from here then. Inspection of the input VCF file (from my data rather than the other VCF file used for training) by eye appears to suggest it does actually have the DP annotation. The training file does not have the DP annotation; but you just said the training file is not required to contain any annotations. The only thing I can think of is that somehow I managed to call a lot of variants in the new data and none of them with agree with the known variants in the training set. That seems a bit unlikely; but would it give an error message like the one above?

    William

  • WVNicholsonWVNicholson Warwick University, CoventryPosts: 14Member

    On further investigation, my training VCF file may not be suitable. It looks like it has no SNPs - just indels and something described as "DUP" in the ALT field (CNVs, I think),

    William

  • WVNicholsonWVNicholson Warwick University, CoventryPosts: 14Member

    It turns out I have another VCF file with known SNPs and MNPs that should be suitable. However, when I attempt to use that I get a different error message:

    ERROR MESSAGE: Input files /home/u1374090/sorghum/gatkwork/../ensembl/variations/sorghum_bicolor.snps_mnps_only_wvned2.vcf and reference have incompatible contigs: Relative ordering of overlapping contigs differs, which is unsafe.

    I'm confused by the error message; because the contigs in the various files refer to chromosomes and should be treated as non-overlapping. I can however re-order the data for the contigs in the training VCF to match those in my VCF with my variant calls; but the error message may be a clue that there are other problems,

    William

    Issue · Github
    by Geraldine_VdAuwera

    Issue Number
    987
    State
    open
    Last Updated
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 7,781Administrator, GATK Dev admin

    @WVNicholson Sorry for the confusion. What the error message means by "overlapping contigs" is "contigs that are present in both files". It doesn't mean to imply that the sequences overlap. We'll try to improve the phrasing to clarify this.

    Geraldine Van der Auwera, PhD

Sign In or Register to comment.