To celebrate the release of GATK 4.0, we are giving away free credits for running the GATK4 Best Practices pipelines in FireCloud, our secure online analysis portal. It’s first come first serve, so sign up now to claim your free credits worth $250. Sponsored by Google Cloud. Learn more at

Best Practice pipeline for WES, including VQSR - what is best?

SteveLSteveL BarcelonaMember

I am currently processing ~100 exomes and following the Best Practice recommendations for Pre-processing and Variant Discovery. However, there are a couple of gaps in the documentation, as far as I can tell, regarding exactly how to proceed with VQSR with exome data. I would be grateful for some feedback, particularly regarding VQSR. The issues are similar to those discussed on this thread: but my questions aren't fully-addressed there (or elsewhere on the Forum as far as I can see).

Prior Steps:
1) All samples processed with same protocol (~60Mb capture kit) - coverage ~50X-100X
2) Alignment with BWA-MEM (to whole genome)
3) Remove duplicates, indel-realignment, bqsr
4) HC to produce gVCFs (-ERC)
5) Genotype gVCFs

This week I have been investigating VQSR, which has generated some questions.

Q1) Which regions should I use from my data for building the VQSR model?

Here I have tried 3 different input datasets:

a) All my variant positions (11Million positions)
b) Variant positions that are in the capture kit (~326k positions) - i.e. used bedtools intersect to only extract variants from (1)
c) Variant positions that are in the capture kit with padding of 100nt either side (~568k positions) - as above but bed has +/-100 on regions + uniq to remove duplicate variants that are now in more than one bed region

For each of the above, I have produced "sensitive" and "specific" datasets:
"Specific" --ts_filter_level 90.0 \ for both SNPs and INDELs
"Sensitive" --ts_filter_level 99.5 \ for SNPs, and --ts_filter_level 99.0 \ for INDELs (as suggested in the definitive FAQ )

I also wanted to see what effect, if any, the "-tranche" argument has - i.e. does it just allow for ease of filtering, or does it affect the mother generated, since it was not clear to me. I applied either 5 tranches or 6:

5-tranche: -tranche 100.0 -tranche 99.9 -tranche 99.5 -tranche 99.0 -tranche 90.0 \ for both SNPs and INDELs
6-tranche: -tranche 100.0 -tranche 99.9 -tranche 99.5 -tranche 99.0 -tranche 95.0 -tranche 90.0 \ for both SNPs and INDELs

To compare the results I then used bed intersect to get back to the variants that are within the capture kit (~326k, as before). The output is shown in the spreadsheet image below.

What the table appears to show me, is that at the "sensitive" settings (orange background), the results are largely the same - the difference between "PASS" in the set at the bottom where all variants were being used, and the others is mostly accounted for by variants being pushed into the 99.9-100 tranche.

However, when trying to be specific (blue background), the difference between using all variants, or just the capture region/capture+100 is marked. Also surprising (at least for me) is the huge difference in "PASS" in cells E15 and E16, where the only difference was the number of tranches given to the model (note that there is very little difference in the analogous cells in Rows 5/6 andRows 10/11.

Q2) Can somebody explain why there is such a difference in "PASS" rows between All-SPEC and the Capture(s)-Spec
Q3) Can somebody explain why 6 tranches resulted in ~23k more PASSes than 5 tranches for the All-SPEC
Q4) What does "PASS" mean in this context - a score =100? Is it an observation of a variant position in my data that has been observed in the "truth" set? It isn't actually described in the header of the VCF, though presumably the following corresponds:
FILTER=<ID=VQSRTrancheSNP99.90to100.00+,Description="Truth sensitivity tranche level for SNP model at VQS Lod < -38682.8777">
Q5) Similarly, why do no variants fall below my lower tranche threshold of 90? Is it because they are all reliable at least to this level?

Q6) Am I just really confused? :-(

Thanks in advance for your help! :-)


  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hi @SteveL,

    This is a really complex set of questions, which I can't take the time to address fully right now, but I'll try to give you some key points that will hopefully help you dig deeper.

    Regarding intervals: the key point to understand is that the VQSR works by grouping variants into multidimensional clusters, with the aim of identifying what does a "good" variant look like based on the training and truth sets. Because we're relying on finding similarities between variants, it is generally simpler, for exome data, to limit the analysis to on-target regions, plus some minor padding . Otherwise we introduce a source of noise since we expect variants in off-target to exhibit properties that may diverge from those of on-target variants. At high sensitivity settings this won't have a marked effect, because we're accepting most variants anyway, but at higher specificity levels, where we're much more picky, is where the noisiness of the model can have much more impact on what variants are retained vs filtered.

    The use of tranches is mostly a convenience function that allows us to evaluate up front certain slices of the data. The definition of tranches has no effect on the calculations as such, but because their properties are analyzed cumulatively, the numbers associated with them may appear different when you run with different tranches.

    The score of a variant is derived from its position relative to the clusters of variants identified as "good" by the training model. The threshold for variants to be called PASSing is determined on the basis of the fraction of truth sites that are retained for a given tranche. Eg if you chose 99.9, it means you are accepting as threshold whatever score is necessary to retain the top-scoring 99.9% of variants in your truth set, if you were to apply filtering to that set on the basis of the training model that was produced.

    I hope that I have not added further to your confusion :)

  • SteveLSteveL BarcelonaMember

    Hi @Geraldine_VdAuwera, Thanks for the detail in your answer, as always. I have three immediately practical questions to follow-up.

    1) Should I subset the VQSR resources data to the same regions I am interested in before running VQSR? i.e. should I intersect the training resources (below) with e.g. my padded-exome-capture-region bed and then use this intersect to perform the VQSR - presumably avoiding the "noise" of the other putative variants spread throughout the whole genome we have aligned to. I did not do this in the examples above.

     -resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.b37.sites.vcf \
     -resource:omni,known=false,training=true,truth=true,prior=12.0 1000G_omni2.5.b37.sites.vcf \
     -resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G_phase1.snps.high_confidence.vcf \
     -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp.b37.vcf \

    However, given that the best-practice recommendations are for high-sensitivity, rather than high-specificity, perhaps this is unnecessary (given the results I show in the table in my original post).

    2) My question regarding the meaning of PASS was because of confusion regarding two issues:

    Firstly, PASS is issued to a position, but that can not necessarily imply that all the calls at that position are reliable, as far as I can tell - I guess it suggests that in general the calls in that position look reliable - notwithstanding if we have a 0/1 call and an AD of 2, we might be dubious as to whether it was real or not? Similarly if the ALT has more than one variant (e.g. A,T), the 0/1 calls may be reliable but the 0/2 not? Therefore it probably is necessary to apply some form of hard-filtering?

    Secondly, if I choose --ts_filter_level 99.5, surely the PASS are all 99.5+, so why do I have values in a 99.9-100.0 tranche? I must be misunderstanding something here.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hi @SteveL,

    1) There's no need to subset the resources up front; this will be done internally as the program intersects the resources and your callset when determining what to use for the model.

    2) The score (and PASS status) is given at the variant level, based on the variant-level statistics, and represents the notion that at least one of the samples shows evidence of variation at this site. The reliability of individual genotype calls is represented separately by the GQ and PL values. We have a genotype refinement workflow that you can use to calibrate the reliability of individual genotypes if you care about them (this depends on the type of study you are doing).

    The percentage in the filter level does not correspond to individual score values; it corresponds to the percentage of truth sites that you recover if you apply the filters to the truth set based on the score calibration derived by the model.

  • SteveLSteveL BarcelonaMember

    Thanks @Geraldine_VdAuwera, So, now I think that my main issues is that I have been understanding the tranches back-to-front. It should be that tthe next best tranche, after PASS, is the lowest? i.e. A variant in a -99.5 tranche is more likely to be real, according to the model, than a variant in a 99.5-99.9 tranche? If this is correct, then that basically explains all of my original doubts, together with knowing that the VQSR program is smart enough to identify regions it can use for model generation.

    I'll take a good look at the genotype refinement workflow threads - thanks.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Yep, that's right -- a lot of people get tripped up by that so don't feel bad :)

Sign In or Register to comment.