Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Outputting and Using VariantRecalibration Models

Hello,

I'm working on optimizing the variant filtering pipeline for my team. Currently we're using VQSR following the best practices guidelines. I've been testing VQSR's ability to discern FPs and TPs by applying VQSR to sequencing data we've generated from the GM12878 cell line and comparing those VCFs to GIABs gold standard na12878 call-set using VCFeval. While following best practices for VQSR lowers the total number of FPs, it also lowers the F measure score, meaning more TPs are being filtered than FPs. This isn't optimal, naturally.

If I apply VariantRecalibrator with the GIAB snp call-set as a resource more FPs are filtered than TPs down to a tranche specificity of ~99.9. Ideally, I'd like to train a VQSR model using VCFs with gold standard call-sets as resources, output that model, and then apply that model to other VCFs.

I've been working to test the possibility of using this approach to variant filtering. The first step is to feed VariantRecalibrator a GM12878 library and the GIAB truth set, output the model, apply the recalibration and get results from VCFeval.

/home/gatk-4.0.3.0/gatk VariantRecalibrator \
   --reference $ref_fa \
   --variant $raw_snp \
   --resource GIAB,known=false,training=true,truth=true,prior=15.0:$GIAB_snp \
   --use-annotation DP \
   --use-annotation QD \
   --use-annotation FS \
   --use-annotation SOR \
   --use-annotation MQ \
   --use-annotation MQRankSum \
   --use-annotation ReadPosRankSum \
   --mode SNP \
   --truth-sensitivity-tranche 100.0 \
   --truth-sensitivity-tranche 99.98 \
   --truth-sensitivity-tranche 99.95 \
   --truth-sensitivity-tranche 99.90 \
   --output recalibrate_snp_giab.recal \
   --tranches-file recalibrate_snp_giab.tranches \
   --rscript-file recalibrate_snp_giab.plots.R \
   --output-model recalibrate_snp_giab.model

/home/gatk-4.0.3.0/gatk ApplyVQSR \
   --reference $ref_fa \
   --variant $raw_snp \
   --output ${laneid}.filtered.giab.99.9.snp.vcf.gz \
   --truth-sensitivity-filter-level 99.9 \
   --tranches-file recalibrate_snp_giab.tranches \
   --recal-file recalibrate_snp_giab.recal \
   --mode SNP

Next I want to input the model and the same VCF to VariantRecalibrator - ideally without resources, though that isn't possible - apply the recalibration and get results from VCFeval. If what I'm looking to do is possible, the two results should be the same for any given tranche.

An example of VariantRecalibrator options I've tried are below.

/home/gatk-4.0.3.0/gatk VariantRecalibrator \
   --reference $ref_fa \
   --variant $raw_snp \
   --resource HapMap,known=false,training=true,truth=true,prior=15.0:$HapMap \
   --resource Omni,known=false,training=true,truth=true,prior=12.0:$Omni \
   --resource 1000G,known=false,training=true,truth=false,prior=10.0:$Thousand_g  \
   --resource dbsnp,known=true,training=false,truth=false,prior=2.0:$DBsnp \
   --input-model $snp_model_orig \
   --use-annotation DP \
   --use-annotation QD \
   --use-annotation FS \
   --use-annotation SOR \
   --use-annotation MQ \
   --use-annotation MQRankSum \
   --use-annotation ReadPosRankSum \
   --mode SNP \
   --truth-sensitivity-tranche 100.0 \
   --truth-sensitivity-tranche 99.98 \
   --truth-sensitivity-tranche 99.95 \
   --truth-sensitivity-tranche 99.90 \
   --output recalibrate_snp_${laneid}.recal \
   --tranches-file recalibrate_snp_${laneid}.tranches \
   --rscript-file recalibrate_snp_${laneid}.plots.R \

/home/gatk-4.0.3.0/gatk ApplyVQSR \
   --reference $ref_fa \
   --variant $raw_snp \
   --output ${laneid}.filtered.model.99.9.snp.vcf.gz \
   --truth-sensitivity-filter-level 99.9 \
   --tranches-file recalibrate_snp_${laneid}.tranches \
   --recal-file recalibrate_snp_${laneid}.recal \
   --mode SNP

As I must supply some resources to VariantRecalibrator, I also tried minimizing the effect of any resources.

/home/gatk-4.0.3.0/gatk VariantRecalibrator \
   --reference $ref_fa \
   --variant $raw_snp \
   --resource HapMap,known=false,training=true,truth=true,prior=0.0:$HapMap \
   --input-model $snp_model_ms \
   --use-annotation DP \
   --use-annotation QD \
   --use-annotation FS \
   --use-annotation SOR \
   --use-annotation MQ \
   --use-annotation MQRankSum \
   --use-annotation ReadPosRankSum \
   --mode SNP \
   --truth-sensitivity-tranche 100.0 \
   --truth-sensitivity-tranche 99.98 \
   --truth-sensitivity-tranche 99.95 \
   --truth-sensitivity-tranche 99.90 \
   --output recalibrate_snp_${laneid}.recal \
   --tranches-file recalibrate_snp_${laneid}.tranches \
   --rscript-file recalibrate_snp_${laneid}.plots.R \

Neither of these approaches have been very successful in producing results similar to the model when it was first generated and applied. Is there anyway to use an output model "as is" without having it changed by VariantRecalibrator the second time around? Or do I misunderstand the nature of VQSR and how models are trained and applied?

Thanks a ton for any help! I really appreciate all the work y'all do!

-Ellis

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @CG_Anderson
    Hi Ellis,

    If I apply VariantRecalibrator with the GIAB snp call-set as a resource more FPs are filtered than TPs down to a tranche specificity of ~99.9.

    Can you post some tranche plots of different sensitivity levels? It would be nice if you show 90, 95 and 99.9.

    Ideally, I'd like to train a VQSR model using VCFs with gold standard call-sets as resources, output that model, and then apply that model to other VCFs.

    This is not ideal. VQSR uses the annotation values of the variants in your input callset to make the models, not the variant sites in your resource files. The models will differ from callset to callset, so it is not a good idea to have one model for every callset.

    Perhaps the presentation on VQSR will help here.

    -Sheila

  • CG_AndersonCG_Anderson Member
    edited June 2018

    Here's the image of the tranche plots using GIAB true positive sites to train the model.

    I understand that this is an adaptation of how VQSR is intended to be used. The idea is that training the model with known data can find relationships between annotation values that might be generalizable and could then be applied to other data sets in a way that might be more effective than applying VQSR using the best practices recommendations. I don't know if this will work necessarily, but I'm interested in testing it out. Is it possible to input a model and have VariantRecalibrator use strictly that model?

    Thank you for the help! I really appreciate it.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @CG_Anderson
    Hi,

    That tranche plot looks off. Sorry to hassle you, but can you check if you get the same output from 4.0.5.1? I don't know if this is a bug or "quirky output" from 4.0.3.0

    Is the plot from using all resource files we recommend? Also, can you produce some tranche plots using different sensitivity levels and post them (setting --truth-sensitivity-filter-level to different values like 90, 95 and 99.9)?

    Thanks,
    Sheila

Sign In or Register to comment.