US Holiday notice: this Thursday and Friday (Nov 25-26) the forum will be unattended. Normal service will resume Monday Nov 29. Happy Thanksgiving!

Tranches plot with tranches less than 90

sccscc Otago, New ZealandPosts: 4Member
edited March 24 in Ask the GATK team

Hi, I am planning to filter with ApplyRecal very stringently (I don't mind losing a lot of SNPs to reduce false positives), but I am having trouble viewing the tranches plot produced by VariantRecal to choose my cutoff level. When I run the following command I get the default visualisation of 90, 99, 99.9 & 100 with cumulative TPs & FPs, and everything looks fine:

java -Xmx8g -jar $gatk \ -T VariantRecalibrator \ -R $ref \ -input $vcf \ -resource:hapmap,known=false,training=true,truth=true,prior=15.0 $hapmap \ -resource:omni,known=false,training=true,truth=false,prior=12.0 $onek \ -resource:dbsnp,known=true,training=false,truth=false,prior=8.0 $dbsnp \ --minNumBadVariants 1000 \ -an QD \ -an MQRankSum \ -an ReadPosRankSum \ -an MQ \ -an FS \ -an DP \ -mode SNP \ -recalFile ${haplocall_date}_all_sites.snp.recal \ -tranchesFile ${haplocall_date}_all_sites.snp.tranches \ --rscript_file ${haplocall_date}_all_sites.snp.rscript \ 1>snp_recal_all_sites.out 2>snp_recal_all_sites.err &

But when I add in the following flags after '-mode SNP' to change the tranches, the plot displays them out of order (and the Ti/Tv ratio doesn't seem to correlate as I expected):

-tranche 80.0 -tranche 85.0 -tranche 90.0 -tranche 99.0 \

This means the cumulative FP/TP display is no longer appropriate and is making it hard to tell what the actual proportions are at each level. When I specify fewer tranches, it displays them in the same descending order no matter how many and in what order I specify them:

The total number of variants that get through the filter also seems to change depending on what tranches are specified (eg the 85 cutoff has ~50K in the second graph and ~30 in the third graph); I would have thought it would be the same for the same dataset, but I might be misunderstanding something.

1-Why does the number of variants change when the same tranch is called in a different set of tranches? (And the Ti/Tv ratio seems to decrease regardless of whether the tranch is above or below 90)?

2- Why does the default tranch plot show ascending order of the tranches, but when I specify my own cutoffs it never does?

3- Alternatively, is there a way to remove the display of the cumulative TP/FPs to make the plot easier to read?

4- Or perhaps a simpler solution, can I bypass the plot altogether? Does one of the outputs from VariantRecal have the data (or a way of calculating it) about TP and FP predicted variants that is used to produce these plots so I can just compare the numbers?

Thanks very much, and let me know if I need to clarify anything

Post edited by scc on

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,682Administrator, GATK Developer admin

    Hi there,

    1-2) If I recall correctly, the tranches are plotted in order following the predicted Ti/Tv, not by the tranche percentage number. When you use the default tranches, Ti/Tv is generally expected to decrease as the tranche percentage increases, which is why the ordering is so neat. I'm not sure why the other (much lower) tranches have Ti/Tv ratios that are not in line with that trend. Meaning I don't know whether that corresponds to a biological reality, shows an issue with the callset, or is a bug. I will ask @rpoplin, the author of the VQSR, to weigh in on this. I do think there may be a bug in how the numbers are summed due to the unexpected ordering of tranches, which would explain the discrepancy between total numbers between plots.

    3-4) You can certainly hack the plotting script to modify it if you'd like. The source is available on our github repository (see FAQs). You can use the tables produced by VariantRecalibrator to plot the data whichever way you want it. And you can use VariantEval to do a more detailed evaluation of your callset quality. We certainly recommend it.

    Geraldine Van der Auwera, PhD

  • sccscc Otago, New ZealandPosts: 4Member

    Thanks very much Geraldine, I'll have a go at VariantEval and getting into the plotting script! Cheers for all the great work you guys do

  • crojocrojo CaliforniaPosts: 8Member

    I think I am running into a similar problem myself, where lower tranches have lower Ti/Tv ratios. Perhaps given that for me this issue is occurring within the range of default tranches this might be an issue with the callset (?) but wanted to provide my figures in case it helps or if GATK team had other ideas. Attached also is the log file:

    Command Line:

    java -Xmx20g -jar GenomeAnalysisTK.jar -nt 8 -T VariantRecalibrator \ -R $SCRATCH/ucsc.hg19.fasta \ -input $SCRATCH/ihg-client.ucsf.edu/seielstadm/Measles_Genotyped_1000bp_capture_Gvcf.vcf \ -recalFile $SCRATCH/ihg-client.ucsf.edu/seielstadm/Measles_1000bp_4.recal \ -tranchesFile $SCRATCH/ihg-client.ucsf.edu/seielstadm/Measles_1000bp_4.tranches -resource:hapmap,known=false,training=true,truth=true,prior=15.0 $SCRATCH/ihg-client.ucsf.edu/seielstadm/hapmap_3.3.hg19.vcf -resource:omni,known=false,training=true,truth=true,prior=12.0 $SCRATCH/ihg-client.ucsf.edu/seielstadm/1000G_omni2.5.hg19.vcf \ -resource:1000G,known=false,training=true,truth=false,prior=10.0 $SCRATCH/ihg-client.ucsf.edu/seielstadm/1000G_phase1.snps.high_confidence.hg19.vcf \ -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 $SCRATCH/ihg-client.ucsf.edu/seielstadm/dbsnp_137.hg19.vcf \ -an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an InbreedingCoeff -mode SNP \ -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 97.5 -tranche 95.0 -tranche 92.5 -tranche 90.0 -tranche 85.0 \ -rscriptFile recalibrate_SNP_plots2_1000bp_4.R

    Default tranches:

    Additional tranches:

    txt
    txt
    1000bp_capture_snpVariantRecal_time_4.txt
    18K
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,682Administrator, GATK Developer admin

    This does sound like it's caused by the dataset deviating from the expected Ti/Tv profile. In such a case I think you do need to look at the Ti/Tv distribution in your data more closely, with other tools since our script is not built to handle such cases. In future we'll see if we can make the script more flexible, but for now I'm afraid we can't devote resources to this. Sorry I don't have anything more useful for you... Good luck!

    Geraldine Van der Auwera, PhD

  • bioinfo_89bioinfo_89 IndiaPosts: 26Member

    Hello!

    I am working on exome data and I have used the following command for variant recalibration for the same: java -Xmx4g -jar GenomeAnalysisTK-2.8-1-g932cd3a/GenomeAnalysisTK.jar -T VariantRecalibrator -R hg19_ref.fa -input 376_Unified_snp.vcf -recalFile 376_snp_1.output.recal -tranchesFile 376_snp_1.output.tranches --maxGaussians 3 -resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.hg19.sites.vcf -resource:omni,known=false,training=true,truth=true,prior=12.0 1000G_omni2.5.hg19.sites.vcf
    -resource:1000G,known=false,training=true,truth=false,prior=10.01000G_phase1.snps.high_confidence.hg19.vcf -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp_137.hg19.vcf -an QD -an HaplotypeScore -an FS -an MQ -mode SNP -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 97.5 -tranche 95.0 -tranche 92.5 -tranche 90.0 -tranche 85.0 -rscriptFile 376_recal_snp_1.plots.R

    And I am also attaching the tranches plot I get.

    I wanted to know that looking at the tranches plot which tranche would you suggest me to consider for futher analysis? According to me as it looks 97.5 tranche should be considered as its has highest amout of TPs considering the rest ones, however 99 tranche has cumulative TPs + tranche specfic TPs. So which tranche should be considered?

    My actual confusion is between Tranche specific TPs and Cumulative TPs here!

    Thank you!

    pdf
    pdf
    376_snp_2.output.tranches.pdf
    8K
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,682Administrator, GATK Developer admin

    Hi there,

    Your plots look a little weird because the tranches are out of order; I believe this might be due to the fact that you're not specifying an exome-adapted titv target (which I guess we need to add to our docs explicitly as it's a common mistake). Setting the -titv argument to something like 3 for exome should make your plots look better (if my hunch is correct). This doesn't affect the actual recalibration, but affects how the titv estimation is done in the plots.

    Once your plots are straightened out I expect the 99 tranche will look like the best middle of the road tradeoff point, though if you are looking to favor either specificity or sensitivity you may want to go for a lower or higher tranche, respectively.

    The thing to understand about tranches is that they're basically slices of dataset. The higher you set your tranche cutoff, the more slices you accumulate on your plate. Tranche-specific TPs are true-positive calls that you gain when you add a slice to your plate. Cumulative TPs are true-positive calls that are contained in all the slices you have added already. So this differentiation allows you to evaluate, if you go to the next tranche up, how many more TPs do you gain vs. the additional FPs that you have to take on as part of the deal.

    Does that make sense?

    Geraldine Van der Auwera, PhD

  • bioinfo_89bioinfo_89 IndiaPosts: 26Member

    Yes !! Thanks that makes a lot of sense!!

Sign In or Register to comment.