Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
Attention:
We will be out of the office on October 14, 2019, due to the U.S. holiday. We will return to monitoring the forum on October 15.

How to choose the right threshold for VariantRecalibrator

benjaminpelissiebenjaminpelissie Madison, WIMember

Hi!
I am using VQSR on a non-model species (88 whole genomes, following strictly the GATK best practices). For that I use a database containing ~40000 SNPs, obtained by genotyping by sequencing on 150 different samples. And I do not know which prior I should use for it. For now my commands are:

java -Xmx120g -jar $GATK -T VariantRecalibrator
-R /data2/CPBWGS/ref_genome/Ldec.genome.10062013.fa
-input /data2/CPBWGS/all_fastq_files/decemlineata/CPBWGS_leptinotarsa_SNPs.vcf
-recalFile Leptinotarsa_VariantRecalibrator_prior10.recal
-tranchesFile Leptinotarsa_VariantRecalibrator_prior10.tranches
-resource:dbsnp,known=true,training=true,truth=true,prior=10.0 /data2/CPBWGS/all_fastq_files/decemlineata/Analyses/VQSR/TrainingSetSNPs.vcf
-an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR -an DP -an InbreedingCoeff -mode SNP

Because I couldn't figure out which prior I should use, I tried 5, 10 and 20; here are the tranches plots. Note that I can't manage to obtain the other graphs (don't what is wrong with our R configuration on our server; I take any suggestion about it too).

Do you have any hint for me?
Ben

Issue · Github
by Sheila

Issue Number
1425
State
closed
Last Updated
Assignee
Array
Milestone
Array
Closed By
vdauwera

Comments

  • Geraldine_VdAuweraGeraldine_VdAuwera admin Cambridge, MAMember, Administrator, Broadie admin

    Hi @benjaminpelissie,

    The prior is the rate of error in the resource, in phred scale. So your choice of prior should represent your confidence that any given variant call in the resource is reliable or not. In the human VQSR resources, we use scores in the 10 to 15 range for resources we are moderately to very confident in because their validation involved orthogonal technologies (like gene chips), 6 to 8 for those that we consider useful but not necessarily very good because they are only sequencing-based, and 2 for a resource that we would not really trust but use as an indicator of absolute known vs novel status (dbSNP).

    If your resource was generated from sequencing only, I would recommend using a prior in the lower range (6-8).

    That being said you definitely want to generate & look at the other graphs; they will tell you more about whether the model that is built from your resource makes sense.

  • benjaminpelissiebenjaminpelissie Madison, WIMember

    Thank you very much @Geraldine_VdAuwera. I just realized that I didn't upload my tranches plots (for some reason). Here they are. Can you just tell me if they make sense to you?

  • Geraldine_VdAuweraGeraldine_VdAuwera admin Cambridge, MAMember, Administrator, Broadie admin

    The "prior 5" is the one that makes the most sense. The others are a bit mangled, which we often see when the progression of novel TiTv by tranche does not match our expectations. This isn't always meaningful, but it could indicate that when using the higher prior values, the model has less predictive value.

    Do you have any expectations about what the TiTv of real variants should be in your organism? It's very commonly used in humans as a QC metric and AFAIK it holds up well enough in other mammals, but I don't know how appropriate it is in non-mammalian organisms. If my google-fu serves, you are studying beetles -- so are there any guidelines on expected TiTv in insects at all?

  • benjaminpelissiebenjaminpelissie Madison, WIMember

    Thanks Geraldine! And sorry for the late answer. I will finally follow your advice and go for prior=5. As for TiTv, nothing is known for my species (your google-fu served you well) and anyway the ranges of modeled TiTv are so narrow (~1.56-1.57) that I wouldn't rely on it in my case. Cheers!
    Ben

Sign In or Register to comment.