Weird, disordered tranche plots generated by VariantRecalibrator

cr517cr517 CambridgeMember


I have run GATK4 VariantRecalibrator on a VCF file from C. elegans data:

GATK VariantRecalibrator -R c_elegans.PRJNA13758.WS263.genomic.fa -V GGVCF.vcf --resource cendr,known=false,training=true,truth=true,prior=15.0:all.vcf.gz -an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR -mode SNP --output output.recal --tranches-file output.tranche --rscript-file output.plots.R

GGVCF.vcf was output by GATK GenotypeGVCFs in a previous step. all.vcf.gz is a set of short variants that appear in natural isolates of C. elegans.

The plots attached are not ordered by % truth. Also, the bar with 90% truth should be the one with solid boxes, not having cumulative TPs or FPs. Also, these plots how I've got more FPs than TPs. However, my data is deeply sequenced (> 100X) and 95% variants have DP > 10 and QUAL > 30. Can I trust these truth results? The total number of variants in GGVCF.vcf is 22,000. all.vcf.gz contain 2,427,507 variants.


  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin


    VQSR is probably not doing the best job, as it needs many resource files. You can try using the new CNN workflow which is better for small datasets and non-model organisms. Read more about it here.


Sign In or Register to comment.