Holiday Notice:
The Frontline Support team will be slow to respond December 17-18 due to an institute-wide retreat and offline December 22- January 1, while the institute is closed. Thank you for your patience during these next few weeks. Happy Holidays!

Weird, disordered tranche plots generated by VariantRecalibrator

cr517cr517 CambridgeMember

Hi,

I have run GATK4 VariantRecalibrator on a VCF file from C. elegans data:

GATK VariantRecalibrator -R c_elegans.PRJNA13758.WS263.genomic.fa -V GGVCF.vcf --resource cendr,known=false,training=true,truth=true,prior=15.0:all.vcf.gz -an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR -mode SNP --output output.recal --tranches-file output.tranche --rscript-file output.plots.R

GGVCF.vcf was output by GATK GenotypeGVCFs in a previous step. all.vcf.gz is a set of short variants that appear in natural isolates of C. elegans.

The plots attached are not ordered by % truth. Also, the bar with 90% truth should be the one with solid boxes, not having cumulative TPs or FPs. Also, these plots how I've got more FPs than TPs. However, my data is deeply sequenced (> 100X) and 95% variants have DP > 10 and QUAL > 30. Can I trust these truth results? The total number of variants in GGVCF.vcf is 22,000. all.vcf.gz contain 2,427,507 variants.

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @cr517
    Hi,

    VQSR is probably not doing the best job, as it needs many resource files. You can try using the new CNN workflow which is better for small datasets and non-model organisms. Read more about it here.

    -Sheila

Sign In or Register to comment.