genotypeGVCFs call confidence and emit thresholds

jsreddy82jsreddy82 Mayo Clinic, Jacksonville, FL.Member

We have ~10,000 WES samples and generated gVCFs for each using HC. To generate a multi-sample consensus VCF, we performed joint genotyping using genotypeGVCFs. Subsequently we performed VQSR analysis to assign quality scores to variants. If VQSR fails variants, some of which have good quality scores in a few individual samples, can we rescue these by lowering emit or call confidence thresholds at the joint genotyping phase. If we can, how low can we go for these cutoffs?

[Just FYI, we wanted to test if we could rescue these variants by lowering emit and call confidence thresholds to 0 (which we know is a bad idea, allowing poor quality calls to creep in) . The number of variants more than doubled and variants that were initially not even detected received a VQSR "PASS", making the results highly unreliable.]

Thanks
Joseph.

Best Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MA admin
    Accepted Answer

    @jsreddy82 Right now we're not set up to share full-genome GVCFs, unfortunately. I foresee that happening in the not-so-distant future, but it's going to take some doing so don't count on it right now.

    Re: your VQSR questions:

    1. It's hard to say categorically, but I would say the biggest effect here is that the rate at which you accumulate false positives over many samples is much higher than that of true positives. Most true positives will be shared by most samples, so as you add samples, you don't add many new true variants. However, most false positives are private to the samples in which they're called, at least those due to largely random errors. So with every new sample, you add a new crop of false positives. Hence the percentage of true vs false looks a lot worse in a very large cohort.

    2. To be frank, I'm not sure I understand the proposed strategy in detail; but I will say that we find that QUAL thresholds perform poorly for distinguishing high confidence variants. QUAL scores are prone to inflation, which is why we only use them for filtering out what we consider to be the completely implausible tranche of results. I don't think I'd want to use them for anything else.

    3. When confronted with this problem we restrict analysis to the intersection.

    4. You're right that the QD evaluation is not as straightforward for indels as for SNPs, and that could reduce its effectiveness in the indel case. However iirc it still performs well enough to be worth using.

Answers

  • jsreddy82jsreddy82 Mayo Clinic, Jacksonville, FL.Member

    @Geraldine_VdAuwera thank you for your response and suggestions. We will continue to experiment with these thresholds and will keep you posted here.

    A follow-up question, what is your typical type/size of datasets processed to test Broad pipelines?

    Thank you.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    We see projects of all sizes go through, but AFAIK joint calling test runs get done on ~50 samples.

  • jsreddy82jsreddy82 Mayo Clinic, Jacksonville, FL.Member

    Thank you for the information @Geraldine_VdAuwera. Do you have any recommendations/tweaks to rescue variants apart from modifying emit and call confidence thresholds for the sample sizes we are dealing with? (either at the joint genotyping phase or during VQSR). I will look into modifying these thresholds and generating roc curves (I still have to figure out how to do this).
    Appreciate your help!!

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    For that sample size I would try lowering the thresholds to 10 for starters, and playing with the VQSR sensitivity tranches to find the tradeoff that's right for you.

    Also, there's a new step we're adding to the Best Practices that arose as a lesson learned from the ExAC analysis. We're currently writing up docs for it as part of an update to the Best Practices that's coming out soon (this and dropping indel realignment from pre-processing). In a nutshell, before doing VQSR, you hard-filter on InbreedingCoefficient (VariantFiltration args are --filterExpression InbreedingCoeff < -0.3 --filterName InbreedingCoeff), and you remove InbreedingCoefficient from the VQSR annotations.

    For making ROC curves, we're just now adding a step to our workshops where we show how to do that with RTG Tools, a third-party tool that we're adopting for teaching purposes because it's pretty user-friendly. There's a link and install instructions here if you'd like to have a look. Keep an eye on our blog as we'll post the workshop materials there next week; they include a detailed walkthrough of how to plot roc curves from VQSLOD.

  • jsreddy82jsreddy82 Mayo Clinic, Jacksonville, FL.Member

    Thank you @Geraldine_VdAuwera. I will experiment with those args and will use RTG Tools to generate ROC curves as you've suggested. Will look fwd to going over the new workshop materials.

  • jsreddy82jsreddy82 Mayo Clinic, Jacksonville, FL.Member

    Hello @Geraldine_VdAuwera, do you have any recommendations on what SNP call annotations should and should not be use while calibrating VQSLOD scores (with VariantRecalibrator), given a WES sample size of ~10K? We have been using the following:
    For SNVs: QD, MQRankSum, ReadPosRankSum and FS
    For Indels: FS, ReadPosRankSum, MQRankSum and maxGaussians=4

    After ApplyRecalibration with --ts_filter_level at 99.0, this is how our tranches look like for SNVs and Indels:

    SNVs:
    INFO 05:40:44,361 ApplyRecalibration - Read tranche Tranche ts=90.00 minVQSLod=2.2741 known=(194887 @ 3.4440) novel=(81112 @ 2.5197) truthSites(71634 accessible, 64470 called), name=VQSRTrancheSNP0.00to90.00]
    INFO 05:40:44,361 ApplyRecalibration - Read tranche Tranche ts=99.00 minVQSLod=0.5478 known=(382584 @ 3.4622) novel=(467011 @ 2.5720) truthSites(71634 accessible, 70917 called), name=VQSRTrancheSNP90.00to99.00]
    INFO 05:40:44,362 ApplyRecalibration - Read tranche Tranche ts=99.90 minVQSLod=-3.1092 known=(540988 @ 3.4254) novel=(1184450 @ 2.4763) truthSites(71634 accessible, 71562 called), name=VQSRTrancheSNP99.00to99.90]
    INFO 05:40:44,362 ApplyRecalibration - Read tranche Tranche ts=100.00 minVQSLod=-2617.2125 known=(551831 @ 3.3729) novel=(1208339 @ 2.4456) truthSites(71634 accessible, 71634 called), name=VQSRTrancheSNP99.90to100.00]

    Indels:
    INFO 14:02:07,403 ApplyRecalibration - Read tranche Tranche ts=90.00 minVQSLod=1.9445 known=(4365 @ 0.0000) novel=(21949 @ 0.0000) truthSites(2825 accessible, 2542 called), name=VQSRTrancheINDEL0.00to90.00]
    INFO 14:02:07,403 ApplyRecalibration - Read tranche Tranche ts=99.00 minVQSLod=-0.3975 known=(5736 @ 0.0000) novel=(68334 @ 0.0000) truthSites(2825 accessible, 2796 called), name=VQSRTrancheINDEL90.00to99.00]
    INFO 14:02:07,404 ApplyRecalibration - Read tranche Tranche ts=99.90 minVQSLod=-7.0192 known=(6018 @ 0.0000) novel=(78544 @ 0.0000) truthSites(2825 accessible, 2822 called), name=VQSRTrancheINDEL99.00to99.90]
    INFO 14:02:07,404 ApplyRecalibration - Read tranche Tranche ts=100.00 minVQSLod=-184.8103 known=(6047 @ 0.0000) novel=(79108 @ 0.0000) truthSites(2825 accessible, 2825 called), name=VQSRTrancheINDEL99.90to100.00]
    INFO 14:02:07,535 ApplyRecalibration - Keeping all variants in tranche Tranche ts=99.00 minVQSLod=-0.3975 known=(5736 @ 0.0000) novel=(68334 @ 0.0000) truthSites(2825 accessible, 2796 called), name=VQSRTrancheINDEL90.00to99.00]

    Of the 1.84 million variants identified, only 0.92 million (50%) "PASS" VQSR.

    Apart from our earlier discussion regarding adjusting thresholds during joint genotyping to rescue certain variants that fail VQSR, do you think annotations being used during VariantRecalibration might be affecting our final callset of "filtered" variants?

    Thank you for your time,
    Joseph.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi Joseph,

    All the annotations that we recommend for VQSR are expected to affect filtering to some extent. The tutorial from the latest workshop shows how to plot annotations against VQSLOD to investigate how much effect they had and whether the program's decisions look sane relative to expectations. Have you had a chance to look them over?

  • jsreddy82jsreddy82 Mayo Clinic, Jacksonville, FL.Member

    Yes Geraldine, I just completed going through the "Filtering & Evaluation' Appendix and Worksheet, the content was extremely helpful, thank you!! I will start plotting ROC curves to get a better understanding of annotations in our callset.

  • jsreddy82jsreddy82 Mayo Clinic, Jacksonville, FL.Member

    @Geraldine_VdAuwera since I have a multi-sample VCF file with ~2Million variants from 10K WES samples, what should be used as a baseline vcf file for RTG vcfeval? In your workshop you've used a subset of the GIAB NA12878 VCF as your baseline. vcfeval also requires the name of each sample you wish to evaluate, if the input is a multi-sample vcf. I don't know how to scale it up to consider all samples and genotypes in my multi-sample VCF (or if this even possible) to be evaluated against the truth set. Please advise.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @jsreddy82 The use case for RTG vcfeval as we presented it in the workshop is when you want to evaluate the pipeline and/or sequencing technology you're using, and you're running on a sample for which you have truth data, either by itself or in the context of a cohort (at this point we have sequenced and analyzed NA12878 so many times...).

    If you didn't include a known sample in your cohort (most people don't) then that particular method doesn't apply (at least not as far as we're concerned -- the RTG devs may have other use cases covered). You would then instead use a tool like VariantEval to compute general summary statistics and compare them to what you expect based on other relevant sources (eg population data, comparisons to 1000G, ExAC and so on).

  • jsreddy82jsreddy82 Mayo Clinic, Jacksonville, FL.Member

    Thank you Geraldine. I think the best way to proceed is to include NA12878 and redo joint-genotyping on my 10K WES samples, and then compare it to the gold standard callset. You provided NA12878 g.vcf for chr20, Will you be able to share, if available, the g.vcf for NA12878 for all chromosomes (hg19)?

    I have a few follow-up questions:

    In one of your documents (https://www.broadinstitute.org/gatk/guide/article?id=6308) you have discussed the importance of selecting truth sets, one of the assumptions being “that your samples are expected to have similar genomic content as the population of samples that was used to produce the truth set”. Our GATK pipelines are setup to run VQSR with HapMap, Omni as true-sites and 1000G for non-true sites for SNPs and Mills for Indels as mentioned here: http://gatkforums.broadinstitute.org/gatk/discussion/1259/which-training-sets-arguments-should-i-use-for-running-vqsr

    1. In our WES study with ~10K samples, I evaluated percentage of pass variants in relation to sample size keeping truth sensitivity level for filtering (--ts_filter_level) at 99 at different sample sizes (50, 100, 500, 2000, 5000 and 10K). I found that from an initial pass percent of 85 when N=50, percentage of variants that pass drops to almost 55% when N=10K. Is this something you expect to see with WES samples? Some of these variants that do not pass VQSR have good GQ scores. While I am in the process of evaluating how various annotations are affecting VQSLOD scores, is it also safe to assume that the training sets being used, given the large sample size, are in some way affecting variant filtration?
    2. In a recent study [PMID: 26974950], the authors used a combination of custom training sets as well as empirically derived thresholds for site and genotype filters to balance sensitivity and specificity. We do not have similar ExomeChip data to build a custom training set, but it seems that it might be helpful to build something similar in silico using our own data. What we have in mind is to use training sets to identify initially a set of high confidence variants within our dataset, by increasing emit and call confidence thresholds during joint genotyping (from 30 to 50). We would then these high confidence SNPs and INDELs as training sets to re-evaluate the entire callset with default emit (30) and call confidence (30) thresholds. What are your thoughts on this?
    3. Our 10K WES samples were captured using two different kits. Our current thinking is that it will be better to perform joint genotyping only on variants in common capture regions (intersection) rather than to use a ‘union’ of all sites in both the capture kits. What are your thoughts on this?

    We really appreciate your help and support,
    Thank you for your time,
    Joseph.

    Issue · Github
    by Sheila

    Issue Number
    1042
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    vdauwera
  • jsreddy82jsreddy82 Mayo Clinic, Jacksonville, FL.Member

    Dear @Geraldine_VdAuwera and @Sheila

    Sorry for bombarding you with all these questions and I have one more to add to the list above.

    For WES data, it is recommended that DP should not be used for calibrating VQSLOD. While QD is recommended, can/should it be used for both SNPs and INDELs? Confusion arises from my understanding that since QD is a QUAL normalized over alternate allele depth (AD), and AD for INDELs is not as uniform as it is for SNPs, does this (using QD for INDELs ) in any way affect VQSLOD scores. Your thoughts?

    PS. Forgive me if any of my question sound naive. This is my first time using GATK and I'm trying to understand and implement GATK best practices as accurately as possible on our WES data.

    Thanks,
    Joseph.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
    Accepted Answer

    @jsreddy82 Right now we're not set up to share full-genome GVCFs, unfortunately. I foresee that happening in the not-so-distant future, but it's going to take some doing so don't count on it right now.

    Re: your VQSR questions:

    1. It's hard to say categorically, but I would say the biggest effect here is that the rate at which you accumulate false positives over many samples is much higher than that of true positives. Most true positives will be shared by most samples, so as you add samples, you don't add many new true variants. However, most false positives are private to the samples in which they're called, at least those due to largely random errors. So with every new sample, you add a new crop of false positives. Hence the percentage of true vs false looks a lot worse in a very large cohort.

    2. To be frank, I'm not sure I understand the proposed strategy in detail; but I will say that we find that QUAL thresholds perform poorly for distinguishing high confidence variants. QUAL scores are prone to inflation, which is why we only use them for filtering out what we consider to be the completely implausible tranche of results. I don't think I'd want to use them for anything else.

    3. When confronted with this problem we restrict analysis to the intersection.

    4. You're right that the QD evaluation is not as straightforward for indels as for SNPs, and that could reduce its effectiveness in the indel case. However iirc it still performs well enough to be worth using.

  • mr_davemr_dave MarylandMember

    @Geraldine_VdAuwera said:
    Also, there's a new step we're adding to the Best Practices that arose as a lesson learned from the ExAC analysis. We're currently writing up docs for it as part of an update to the Best Practices that's coming out soon (this and dropping indel realignment from pre-processing). In a nutshell, before doing VQSR, you hard-filter on InbreedingCoefficient (VariantFiltration args are --filterExpression InbreedingCoeff < -0.3 --filterName InbreedingCoeff), and you remove InbreedingCoefficient from the VQSR annotations.

    @Geraldine_VdAuwera does best practices reflect this update regarding InbeedingCoeff somewhere? I don't see it on broadinstitute.org or in the GATK presentations on Google Drive.

    Digging a little further, it looks like gatk-workflow/gatk4-germline-snps-indels on GitHub includes hard filtering ExcessHet > 54.69, a related metric.

    The supplementary methods for the ExAC paper mentions removing InbreedingCoeff < -0.2, although the context suggests that this was a site-filter applied after VQSR.

    I'm trying to find the cause for an excess of false positives after VQSR in roughly 1,500 WGS samples (using dbSNP_138.excluding_sites_after_129). I've included a plot of the InbreedingCoeff and ExcessHet values in autosomal variants after GenotypeGVCFs. Thresholds discussed are marked in dotted blue lines.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @mr_dave
    Hi,

    That comment from Geraldine is from two years ago, and we now recommend using ExcessHet over InbreedingCoefficient. You can try hard filtering with ExcessHet > 54.69 and seeing if that helps.

    -Sheila

  • mr_davemr_dave MarylandMember

    @Sheila Hi Sheila, thank you for the update. Sorry about pulling up an old thread. I'll look into using ExcessHet as a hard filter, a threshold of 54.69 would remove approx 1.5% of ~60m sites.

    Comments here and here seem to elaborate on the ExcessHet recommendation. Those threads and the gatk-workflow/gatk4-germline-snps-indels repo are extent of what I've been able to find regarding any recommendation from GATK on hard filtering with ExcessHet.

    As seen in the plot, an underflow error seems to create a lot of high ExcessHet values, presumably b/c of small p-values. I guess if I were to actually incorporate ExcessHet into VQSR instead of a hard filter, I would need to hard filter those extreme values.

Sign In or Register to comment.