We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

Tranches plot with tranches less than 90

sccscc Otago, New ZealandMember
edited March 2014 in Ask the GATK team

Hi, I am planning to filter with ApplyRecal very stringently (I don't mind losing a lot of SNPs to reduce false positives), but I am having trouble viewing the tranches plot produced by VariantRecal to choose my cutoff level. When I run the following command I get the default visualisation of 90, 99, 99.9 & 100 with cumulative TPs & FPs, and everything looks fine:

java -Xmx8g -jar $gatk \
-T VariantRecalibrator \
-R $ref \
-input $vcf \
-resource:hapmap,known=false,training=true,truth=true,prior=15.0 $hapmap \
-resource:omni,known=false,training=true,truth=false,prior=12.0 $onek \
-resource:dbsnp,known=true,training=false,truth=false,prior=8.0 $dbsnp \
--minNumBadVariants 1000 \
-an QD \
-an MQRankSum \
-an ReadPosRankSum \
-an MQ \
-an FS \
-an DP \
-mode SNP \
-recalFile ${haplocall_date}_all_sites.snp.recal \
-tranchesFile ${haplocall_date}_all_sites.snp.tranches \
--rscript_file ${haplocall_date}_all_sites.snp.rscript \
1>snp_recal_all_sites.out 2>snp_recal_all_sites.err &

But when I add in the following flags after '-mode SNP' to change the tranches, the plot displays them out of order (and the Ti/Tv ratio doesn't seem to correlate as I expected):

-tranche 80.0 -tranche 85.0 -tranche 90.0 -tranche 99.0 \

This means the cumulative FP/TP display is no longer appropriate and is making it hard to tell what the actual proportions are at each level. When I specify fewer tranches, it displays them in the same descending order no matter how many and in what order I specify them:

The total number of variants that get through the filter also seems to change depending on what tranches are specified (eg the 85 cutoff has ~50K in the second graph and ~30 in the third graph); I would have thought it would be the same for the same dataset, but I might be misunderstanding something.

1-Why does the number of variants change when the same tranch is called in a different set of tranches? (And the Ti/Tv ratio seems to decrease regardless of whether the tranch is above or below 90)?

2- Why does the default tranch plot show ascending order of the tranches, but when I specify my own cutoffs it never does?

3- Alternatively, is there a way to remove the display of the cumulative TP/FPs to make the plot easier to read?

4- Or perhaps a simpler solution, can I bypass the plot altogether? Does one of the outputs from VariantRecal have the data (or a way of calculating it) about TP and FP predicted variants that is used to produce these plots so I can just compare the numbers?

Thanks very much, and let me know if I need to clarify anything

Issue · Github
by Geraldine_VdAuwera

Issue Number
Last Updated
Closed By


  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi there,

    1-2) If I recall correctly, the tranches are plotted in order following the predicted Ti/Tv, not by the tranche percentage number. When you use the default tranches, Ti/Tv is generally expected to decrease as the tranche percentage increases, which is why the ordering is so neat. I'm not sure why the other (much lower) tranches have Ti/Tv ratios that are not in line with that trend. Meaning I don't know whether that corresponds to a biological reality, shows an issue with the callset, or is a bug. I will ask @rpoplin, the author of the VQSR, to weigh in on this. I do think there may be a bug in how the numbers are summed due to the unexpected ordering of tranches, which would explain the discrepancy between total numbers between plots.

    3-4) You can certainly hack the plotting script to modify it if you'd like. The source is available on our github repository (see FAQs). You can use the tables produced by VariantRecalibrator to plot the data whichever way you want it. And you can use VariantEval to do a more detailed evaluation of your callset quality. We certainly recommend it.

  • sccscc Otago, New ZealandMember

    Thanks very much Geraldine, I'll have a go at VariantEval and getting into the plotting script! Cheers for all the great work you guys do

  • crojocrojo CaliforniaMember

    I think I am running into a similar problem myself, where lower tranches have lower Ti/Tv ratios. Perhaps given that for me this issue is occurring within the range of default tranches this might be an issue with the callset (?) but wanted to provide my figures in case it helps or if GATK team had other ideas. Attached also is the log file:

    Command Line:

    java -Xmx20g -jar GenomeAnalysisTK.jar -nt 8 -T VariantRecalibrator \
    -R $SCRATCH/ucsc.hg19.fasta \
    -input $SCRATCH/ihg-client.ucsf.edu/seielstadm/Measles_Genotyped_1000bp_capture_Gvcf.vcf \
    -recalFile $SCRATCH/ihg-client.ucsf.edu/seielstadm/Measles_1000bp_4.recal \
    -tranchesFile $SCRATCH/ihg-client.ucsf.edu/seielstadm/Measles_1000bp_4.tranches -resource:hapmap,known=false,training=true,truth=true,prior=15.0 $SCRATCH/ihg-client.ucsf.edu/seielstadm/hapmap_3.3.hg19.vcf -resource:omni,known=false,training=true,truth=true,prior=12.0 $SCRATCH/ihg-client.ucsf.edu/seielstadm/1000G_omni2.5.hg19.vcf \ -resource:1000G,known=false,training=true,truth=false,prior=10.0 $SCRATCH/ihg-client.ucsf.edu/seielstadm/1000G_phase1.snps.high_confidence.hg19.vcf \
    -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 $SCRATCH/ihg-client.ucsf.edu/seielstadm/dbsnp_137.hg19.vcf \
    -an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an InbreedingCoeff -mode SNP \
    -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 97.5 -tranche 95.0 -tranche 92.5 -tranche 90.0 -tranche 85.0 \
    -rscriptFile recalibrate_SNP_plots2_1000bp_4.R

    Default tranches:

    Additional tranches:

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    This does sound like it's caused by the dataset deviating from the expected Ti/Tv profile. In such a case I think you do need to look at the Ti/Tv distribution in your data more closely, with other tools since our script is not built to handle such cases. In future we'll see if we can make the script more flexible, but for now I'm afraid we can't devote resources to this. Sorry I don't have anything more useful for you... Good luck!

  • bioinfo_89bioinfo_89 IndiaMember


    I am working on exome data and I have used the following command for variant recalibration for the same:
    java -Xmx4g -jar GenomeAnalysisTK-2.8-1-g932cd3a/GenomeAnalysisTK.jar -T VariantRecalibrator -R hg19_ref.fa -input 376_Unified_snp.vcf -recalFile 376_snp_1.output.recal -tranchesFile 376_snp_1.output.tranches --maxGaussians 3
    -resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.hg19.sites.vcf
    -resource:omni,known=false,training=true,truth=true,prior=12.0 1000G_omni2.5.hg19.sites.vcf
    -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp_137.hg19.vcf -an QD -an HaplotypeScore -an FS
    -an MQ -mode SNP -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 97.5 -tranche 95.0 -tranche 92.5 -tranche 90.0 -tranche 85.0 -rscriptFile 376_recal_snp_1.plots.R

    And I am also attaching the tranches plot I get.

    I wanted to know that looking at the tranches plot which tranche would you suggest me to consider for futher analysis? According to me as it looks 97.5 tranche should be considered as its has highest amout of TPs considering the rest ones, however 99 tranche has cumulative TPs + tranche specfic TPs. So which tranche should be considered?

    My actual confusion is between Tranche specific TPs and Cumulative TPs here!

    Thank you!

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi there,

    Your plots look a little weird because the tranches are out of order; I believe this might be due to the fact that you're not specifying an exome-adapted titv target (which I guess we need to add to our docs explicitly as it's a common mistake). Setting the -titv argument to something like 3 for exome should make your plots look better (if my hunch is correct). This doesn't affect the actual recalibration, but affects how the titv estimation is done in the plots.

    Once your plots are straightened out I expect the 99 tranche will look like the best middle of the road tradeoff point, though if you are looking to favor either specificity or sensitivity you may want to go for a lower or higher tranche, respectively.

    The thing to understand about tranches is that they're basically slices of dataset. The higher you set your tranche cutoff, the more slices you accumulate on your plate. Tranche-specific TPs are true-positive calls that you gain when you add a slice to your plate. Cumulative TPs are true-positive calls that are contained in all the slices you have added already. So this differentiation allows you to evaluate, if you go to the next tranche up, how many more TPs do you gain vs. the additional FPs that you have to take on as part of the deal.

    Does that make sense?

  • bioinfo_89bioinfo_89 IndiaMember

    Yes !! Thanks that makes a lot of sense!!

  • bioinfo_89bioinfo_89 IndiaMember

    So by out of order you mean, the sensitivity order in the graph?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @bioinfo_89 said:
    So by out of order you mean, the sensitivity order in the graph?

    Yes; the program orders the data according to per-tranche titv (high to low), and normally that leads to a nice ordering by sensitivity (low to high). But in cases such as yours the expected relationship does not hold, and so we see these seemingly disordered graphs. A while ago I thought this might be due to omitting the expected titv, leading to calculation anomalies, but now I'm not so sure.

  • johnwallace123johnwallace123 Member ✭✭


    I'm also seeing strangeness in our tranche plots that looks like it might have the same cause. It looks like this is caused by the R script ordering the data by TiTv ratio instead of the truth sensitivity tranches. Normally, the 2 follow the same ordering, but if you use a lot of tranches (or have wacky data), the TiTv ratio isn't always ordered the same. I believe that this can be fixed in the plotTranches.R script by changing:

    data2 = data2[order(data2$novelTiTv, decreasing=F),]


    data2 = data2[order(data2$targetTruthSensitivity, decreasing=T),]

    I've attached 2 files ("current.pdf" and "expected.pdf") that highlight the different outputs made by this change. Perhaps others who are more knowledgeable than I can weigh in on if this is the correct "expected" behavior of the tranche plot in this case.


    John Wallace

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @johnwallace123 Technically both are correct, but I think your plots do a great job of highlighting how much more intuitively readable the plots are when they are ordered by truth sensitivity instead of TiTv. I'll put in a ticket to have the script reviewed and hopefully changed accordingly. Thanks for contributing this.

  • bioinfo_89bioinfo_89 IndiaMember


    I have a doubt regarding the following output i have got when i used the follwoing Variant recalibration command:

    java -Xmx4g -jar GenomeAnalysisTK-3.3-0.tar.bz2_FILES/GenomeAnalysisTK.jar -T VariantRecalibrator -R hg19_ref.fa -input raw_snp.vcf -recalFile snp_output.recal -tranchesFilesnp_output.tranches --maxGaussians 4 -resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.hg19.sites.vcf -resource:omni,known=false,training=true,truth=true,prior=12.0 1000G_omni2.5.hg19.sites.vcf -resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G_phase1.snps.high_confidence.hg19.vcf -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp_138.hg19_revised.vcf -an QD -an HaplotypeScore -an FS -an MQ -an MQRankSum -an ReadPosRankSum -mode SNP -rscriptFile recal_snp.plots.R -titv 3 -tranche 100 -tranche 99.9 -tranche 99 -tranche 90

    And I got a tranches plot which is attached below.

    My question is regarding tranche 90 where the block shows Cumulative TPs and FPs together and TPs and cumulative FPs together. What exactly does this mean? The TPs region is also shaded with orange lines.

    Please guide me what is wrong with my graph.

    Thank you in advance.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @bioinfo_89 ,

    This is a common problem where the plots look strange because your tranches do not follow the expected titv ratio expectations, as discussed above. We have a cosmetic fix going in soon (thanks to @johnwallace!) so the sensitivity graph will look more sane. But that will not fix the underlying problem of your data, which has very low titv values.

  • bioinfo_89bioinfo_89 IndiaMember

    So what do you suggest about my data, if I want to use this for further analysis?

  • bioinfo_89bioinfo_89 IndiaMember

    How can this be improved?

  • bioinfo_89bioinfo_89 IndiaMember

    Please have a look at the plot which I have got for my data (exome) using Recalibration step below:

    I am not quite sure about the tranche 90 in the graph. Also which tranche would you suggest to be considered?

    Thank you!

  • bioinfo_89bioinfo_89 IndiaMember

    The Titv scores are very low!!

  • bioinfo_89bioinfo_89 IndiaMember

    I am also attaching the tranches output for your reference.

    Variant quality score tranches file

    Version number 5


  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @bioinfo_89 First, please don't post so many comments in a row. Think about what you want to say then try to say it all in one comment. We receive an email for every single comment each person posts, so if everyone did like you, we would get hundreds of emails a day.

    Second, there is clearly something wrong with your data; it is not usable for downstream analysis. You'll need to check that you followed our best practices, and maybe do some QC analysis on your original data. We may be able to provide some guidance; if you want that, please post a new discussion (not a comment in an existing discussion) and describe in detail how you ran the analysis: what version of GATK you used, what steps are in your workflow, and what command lines you used for each step.

  • bioinfo_89bioinfo_89 IndiaMember

    Ok! I'll definitely keep that in mind!! Thank you!

Sign In or Register to comment.