We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

How to choose the right threshold for VariantRecalibrator

I am using VQSR on a non-model species (88 whole genomes, following strictly the GATK best practices). For that I use a database containing ~40000 SNPs, obtained by genotyping by sequencing on 150 different samples. And I do not know which prior I should use for it. For now my commands are:

java -Xmx120g -jar $GATK -T VariantRecalibrator
-R /data2/CPBWGS/ref_genome/Ldec.genome.10062013.fa
-input /data2/CPBWGS/all_fastq_files/decemlineata/CPBWGS_leptinotarsa_SNPs.vcf
-recalFile Leptinotarsa_VariantRecalibrator_prior10.recal
-tranchesFile Leptinotarsa_VariantRecalibrator_prior10.tranches
-resource:dbsnp,known=true,training=true,truth=true,prior=10.0 /data2/CPBWGS/all_fastq_files/decemlineata/Analyses/VQSR/TrainingSetSNPs.vcf
-an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR -an DP -an InbreedingCoeff -mode SNP

Because I couldn't figure out which prior I should use, I tried 5, 10 and 20; here are the tranches plots. Note that I can't manage to obtain the other graphs (don't what is wrong with our R configuration on our server; I take any suggestion about it too).

Do you have any hint for me?

Issue · Github
by Sheila

Issue Number
Last Updated
Closed By


  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @benjaminpelissie,

    The prior is the rate of error in the resource, in phred scale. So your choice of prior should represent your confidence that any given variant call in the resource is reliable or not. In the human VQSR resources, we use scores in the 10 to 15 range for resources we are moderately to very confident in because their validation involved orthogonal technologies (like gene chips), 6 to 8 for those that we consider useful but not necessarily very good because they are only sequencing-based, and 2 for a resource that we would not really trust but use as an indicator of absolute known vs novel status (dbSNP).

    If your resource was generated from sequencing only, I would recommend using a prior in the lower range (6-8).

    That being said you definitely want to generate & look at the other graphs; they will tell you more about whether the model that is built from your resource makes sense.

  • benjaminpelissiebenjaminpelissie Madison, WIMember

    Thank you very much @Geraldine_VdAuwera. I just realized that I didn't upload my tranches plots (for some reason). Here they are. Can you just tell me if they make sense to you?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    The "prior 5" is the one that makes the most sense. The others are a bit mangled, which we often see when the progression of novel TiTv by tranche does not match our expectations. This isn't always meaningful, but it could indicate that when using the higher prior values, the model has less predictive value.

    Do you have any expectations about what the TiTv of real variants should be in your organism? It's very commonly used in humans as a QC metric and AFAIK it holds up well enough in other mammals, but I don't know how appropriate it is in non-mammalian organisms. If my google-fu serves, you are studying beetles -- so are there any guidelines on expected TiTv in insects at all?

  • benjaminpelissiebenjaminpelissie Madison, WIMember

    Thanks Geraldine! And sorry for the late answer. I will finally follow your advice and go for prior=5. As for TiTv, nothing is known for my species (your google-fu served you well) and anyway the ranges of modeled TiTv are so narrow (~1.56-1.57) that I wouldn't rely on it in my case. Cheers!

Sign In or Register to comment.