I am unable to use VQSR (recalibration) to filter variants

Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,740Administrator, GATK Dev admin
edited October 30 in Common Problems

The problem:

Our preferred method for filtering variants after the calling step is to use VQSR, a.k.a. recalibration. However, it requires well-curated training/truth resources, which are typically not available for organisms other than humans, and it also requires a large amount of variant sites to operate properly, so it is not suitable for some small-scale experiments such as targeted gene panels or exome studies with fewer than 30 exomes. For the latter, it is sometimes possible to pad your cohort with exomes from another study (especially for humans -- use 1000 Genomes or ExAC!) but again for non-human organisms it is often not possible to do this.

The solution: hard-filtering

So, if this is your case and you are sure that you cannot use VQSR, then you will need to use the VariantFiltration tool to manually filter your variants. To do this, you will need to compose filter expressions as explained here and here based on the recommendations detailed further below.

But first, some caveats

Let's be painfully clear about this: there is no magic formula that will give you perfect results. Filtering variants manually, using thresholds on annotation values, is subject to all sorts of caveats. The appropriateness of both the annotations and the threshold values is very highly dependent on the specific callset, how it was called, what the data was like, what organism it belongs to, etc.

HOWEVER, because we want to help and people always say that something is better than nothing (not necessarily true, but let's go with that for now), we have formulated some generic recommendations that should at least provide a starting point for people to experiment with their data.

In case you didn't catch that bit in bold there, we're saying that you absolutely SHOULD NOT expect to run these commands and be done with your analysis. You absolutely SHOULD expect to have to evaluate your results critically and TRY AGAIN with some parameter adjustments until you find the settings that are right for your data.

In addition, please note that these recommendations are mainly designed for dealing with very small data sets (in terms of both number of samples or size of targeted regions). If you are not using VQSR because you do not have training/truth resources available for your organism, then you should expect to have to do even more tweaking on the filtering parameters.

Filtering recommendations

Here are some recommended arguments to use with VariantFiltration when ALL other options are unavailable to you. Note that these JEXL expressions will tag as filtered any sites where the annotation value matches the expression. So if you use the expression QD < 2.0, any site with a QD lower than 2 will be tagged as failing that filter.

For SNPs:

  • QD < 2.0
  • MQ < 40.0
  • FS > 60.0
  • SOR > 4.0
  • HaplotypeScore > 13.0 only for variants output byUnifiedGenotyper; for HaplotypeCaller's output it is not informative
  • MQRankSum < -12.5
  • ReadPosRankSum < -8.0

For indels:

  • QD < 2.0
  • ReadPosRankSum < -20.0
  • InbreedingCoeff < -0.8
  • FS > 200.0
  • SOR > 10.0

And now some more IMPORTANT caveats (don't skip this!)

  • The InbreedingCoeff statistic is a population-level calculation that is only available with 10 or more samples. If you have fewer samples you will need to omit that particular filter statement.

  • For shallow-coverage (<10x), it is virtually impossible to use manual filtering to reliably separate true positives from false positives. You really, really, really should use the protocol involving variant quality score recalibration. If you can't do that, maybe you need to take a long hard look at your experimental design. In any case you're probably in for a world of pain.

  • The maximum DP (depth) filter only applies to whole genome data, where the probability of a site having exactly N reads given an average coverage of M is a well-behaved function. First principles suggest this should be a binomial sampling but in practice it is more a Gaussian distribution. Regardless, the DP threshold should be set a 5 or 6 sigma from the mean coverage across all samples, so that the DP > X threshold eliminates sites with excessive coverage caused by alignment artifacts. Note that for exomes, a straight DP filter shouldn't be used because the relationship between misalignments and depth isn't clear for capture data.

Finally, a note of hope

Some bits of this article may seem harsh, or depressing. Sorry. We believe in giving you the cold hard truth.

HOWEVER, we do understand that this is one of the major points of pain that GATK users encounter -- along with understanding how VQSR works, so really, whichever option you go with, you're going to suffer.

And we do genuinely want to help. So although we can't look at every single person's callset and give an opinion on how it looks (no, seriously, don't ask us to do that), we do want to hear from you about how we can best help you help yourself. What information do you feel would help you make informed decisions about how to set parameters? Are the meanings of the annotations not clear? Would knowing more about how they are computed help you understand how you can use them? Do you want more math? Less math, more concrete examples?

Tell us what you'd like to see here, and we'll do our best to make it happen. (no unicorns though, we're out of stock)

We also welcome testimonials from you. We are one small team; you are a legion of analysts all trying different things. Please feel free to come forward and share your findings on what works particularly well in your hands.

Post edited by Geraldine_VdAuwera on

Geraldine Van der Auwera, PhD


  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,740Administrator, GATK Dev admin

    Questions and comments up to August 2014 have been moved to an archival thread here:


    Geraldine Van der Auwera, PhD

  • GustavGustav SwedenPosts: 12Member

    I am trying to soft filter my indel calls with the variant recalibrator, but have gotten very confused.
    For my snps everything works fine. And even the model plots for the indels looks pretty good. But there is no "PASS" or "FILTERED or anything else written in the filter column in the vcf file. It does not matter which tranche level I pick. Simply nothing happens?? Nothing gets written in the filter column ever, no matter how I have tweaked the settings.
    My data consists of 35 whole exomes. and there are something around 25,000 indel calls.
    I am using the latest version of GATK.
    My two command lines are:

    -T VariantRecalibrator -R /references/hs.build37.1.fa -input samples_poulsen.indel.vcf \
    -resource:mills,known=true,training=true,truth=true,prior=12.0 /RESOURCES/Mills_and_1000G_gold_standard.indels.b37.vcf \
    -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 /dbsnp138.vcf.gz -mode INDEL \
    -recalFile recalibrate_INDEL.recal -tranchesFile recalibrate_INDEL.tranches -rscriptFile recalibrate_INDEL_plots.R \
    -an QD -an FS -an MQRankSum -an ReadPosRankSum -an InbreedingCoeff --maxGaussians 4


    -T ApplyRecalibration -R /references/hs.build37.1.fa \
    -input samples_poulsen.indel.vcf --ts_filter_level 99.0 \
    -recalFile recalibrate_INDEL.recal \
    -tranchesFile recalibrate_INDEL.tranches \
    -o recal_samples_poulsen.indel.vcf

    I would be very thankful for any kind of help or explanation. Thank you.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,740Administrator, GATK Dev admin

    Hi @Gustav,

    You need to specify -mode INDEL for ApplyRecalibration too, otherwise it only recalibrates SNPs.

    Geraldine Van der Auwera, PhD

  • GustavGustav SwedenPosts: 12Member

    Ah ok. Thank you!

  • vsvintivsvinti Posts: 66Member

    Above, you say:
    "for exomes, a straight DP filter shouldn't be used"
    but the hard filters say
    "QD < 2.0"
    Since they are related, do you mean 'Do not use QD < 2.0' filter for exomes ?

  • pdexheimerpdexheimer Posts: 511Member, GATK Dev, DSDE Dev mod

    No. QD != DP

    Depth is a normalizing factor used in the QD calculation, but the two metrics are not related

  • vsvintivsvinti Posts: 66Member

    Ok. Just curious as I notice a 'peak' in variants with QD < 2 that pass the VQSR annotation. So I've been doing VQSR + QD > 2 for retaining variants... this is UG on gatk 2.7 though.

    479 x 470 - 31K
  • spekkiospekkio CanadaPosts: 4Member
    edited August 6

    I have a question about padding a cohort with public exomes from 1000 Genomes phase one data.

    We have 8 exome samples and want to bring the total up to 30, by using 22 phase one 1k genome exomes, in order to run VQSR to find variants. I've seen it mentioned that this is something the Broad institute does when dealing with small sample sizes and would like to know if you do any kind of pre-processing on the public data before you slide it into your pipeline? I've read ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase1/README.phase1_alignment_data how the 1k g phase one .bam files were prepared and from what I figure they are already good to go, ie. run HaplotypeCaller on in -ERC GVCF mode. Am I correct in making this assumption or should there be some steps to further prepare the .bam files before running HaplotypeCaller on them?

    quick edit: if there is some more processing to be done, is it different per sequencing platform, ie. some .bams were from SOLID and some were from ILLUMINA.


    Post edited by spekkio on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,740Administrator, GATK Dev admin

    Hi @spekkio,

    That is indeed our recommendation for dealing with small datasets. Ideally you'd want to apply exactly the same pre-processing to all samples (yours and the 1000G samples), including using the same versions of the tools, to avoid any batch effects. The pre-processing applied to the 1000G samples is essentially the one recommended in our Best Practices, but done with a much older version of the tools. So if you can spare the compute, it is better to revert and reprocess them all in the same way. When you use the padding samples for multiple projects it's easy to justify as a one-time investment. Unfortunately we aren't able to offer recently reprocessed versions of the 1000G samples at this time, although something like that is on the roadmap for our future cloud-based offering, which we're working on with Google.

    Geraldine Van der Auwera, PhD

  • rfraserrfraser Guelph, OntarioPosts: 2Member

    Hi GATK Team,

    Just a quick note - in the "The solution: hard-filtering" paragraph above, the "VariantFiltration" link, along with the first two "here" links, seem to be broken.


  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,740Administrator, GATK Dev admin

    Thanks for pointing this out, @rfraser. Fixed the two VariantFiltration links; that second "here" was a deprecated doc that should no longer be linked to so I've removed it.

    Geraldine Van der Auwera, PhD

Sign In or Register to comment.