We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

Weird, disordered tranche plots generated by VariantRecalibrator

cr517cr517 CambridgeMember


I have run GATK4 VariantRecalibrator on a VCF file from C. elegans data:

GATK VariantRecalibrator -R c_elegans.PRJNA13758.WS263.genomic.fa -V GGVCF.vcf --resource cendr,known=false,training=true,truth=true,prior=15.0:all.vcf.gz -an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR -mode SNP --output output.recal --tranches-file output.tranche --rscript-file output.plots.R

GGVCF.vcf was output by GATK GenotypeGVCFs in a previous step. all.vcf.gz is a set of short variants that appear in natural isolates of C. elegans.

The plots attached are not ordered by % truth. Also, the bar with 90% truth should be the one with solid boxes, not having cumulative TPs or FPs. Also, these plots how I've got more FPs than TPs. However, my data is deeply sequenced (> 100X) and 95% variants have DP > 10 and QUAL > 30. Can I trust these truth results? The total number of variants in GGVCF.vcf is 22,000. all.vcf.gz contain 2,427,507 variants.


  • SheilaSheila Broad InstituteMember, Broadie ✭✭✭✭✭


    VQSR is probably not doing the best job, as it needs many resource files. You can try using the new CNN workflow which is better for small datasets and non-model organisms. Read more about it here.


Sign In or Register to comment.