The current GATK version is 3.3-0

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

# Tranches plot with tranches less than 90

Otago, New ZealandPosts: 4Member
edited March 2014

Hi, I am planning to filter with ApplyRecal very stringently (I don't mind losing a lot of SNPs to reduce false positives), but I am having trouble viewing the tranches plot produced by VariantRecal to choose my cutoff level. When I run the following command I get the default visualisation of 90, 99, 99.9 & 100 with cumulative TPs & FPs, and everything looks fine:

java -Xmx8g -jar $gatk \ -T VariantRecalibrator \ -R$ref \
-input $vcf \ -resource:hapmap,known=false,training=true,truth=true,prior=15.0$hapmap \
-resource:omni,known=false,training=true,truth=false,prior=12.0 $onek \ -resource:dbsnp,known=true,training=false,truth=false,prior=8.0$dbsnp \
-an QD \
-an MQRankSum \
-an MQ \
-an FS \
-an DP \
-mode SNP \
-recalFile ${haplocall_date}_all_sites.snp.recal \ -tranchesFile${haplocall_date}_all_sites.snp.tranches \
--rscript_file ${haplocall_date}_all_sites.snp.rscript \ 1>snp_recal_all_sites.out 2>snp_recal_all_sites.err & But when I add in the following flags after '-mode SNP' to change the tranches, the plot displays them out of order (and the Ti/Tv ratio doesn't seem to correlate as I expected): -tranche 80.0 -tranche 85.0 -tranche 90.0 -tranche 99.0 \ This means the cumulative FP/TP display is no longer appropriate and is making it hard to tell what the actual proportions are at each level. When I specify fewer tranches, it displays them in the same descending order no matter how many and in what order I specify them: The total number of variants that get through the filter also seems to change depending on what tranches are specified (eg the 85 cutoff has ~50K in the second graph and ~30 in the third graph); I would have thought it would be the same for the same dataset, but I might be misunderstanding something. 1-Why does the number of variants change when the same tranch is called in a different set of tranches? (And the Ti/Tv ratio seems to decrease regardless of whether the tranch is above or below 90)? 2- Why does the default tranch plot show ascending order of the tranches, but when I specify my own cutoffs it never does? 3- Alternatively, is there a way to remove the display of the cumulative TP/FPs to make the plot easier to read? 4- Or perhaps a simpler solution, can I bypass the plot altogether? Does one of the outputs from VariantRecal have the data (or a way of calculating it) about TP and FP predicted variants that is used to produce these plots so I can just compare the numbers? Thanks very much, and let me know if I need to clarify anything Post edited by scc on Tagged: #### Issue · Github March 9 by Geraldine_VdAuwera Issue Number 860 State closed Last Updated Assignee Array Milestone Array Closed By vdauwera ## Answers • Posts: 7,528Administrator, GATK Developer admin Hi there, 1-2) If I recall correctly, the tranches are plotted in order following the predicted Ti/Tv, not by the tranche percentage number. When you use the default tranches, Ti/Tv is generally expected to decrease as the tranche percentage increases, which is why the ordering is so neat. I'm not sure why the other (much lower) tranches have Ti/Tv ratios that are not in line with that trend. Meaning I don't know whether that corresponds to a biological reality, shows an issue with the callset, or is a bug. I will ask @rpoplin, the author of the VQSR, to weigh in on this. I do think there may be a bug in how the numbers are summed due to the unexpected ordering of tranches, which would explain the discrepancy between total numbers between plots. 3-4) You can certainly hack the plotting script to modify it if you'd like. The source is available on our github repository (see FAQs). You can use the tables produced by VariantRecalibrator to plot the data whichever way you want it. And you can use VariantEval to do a more detailed evaluation of your callset quality. We certainly recommend it. Geraldine Van der Auwera, PhD • Otago, New ZealandPosts: 4Member Thanks very much Geraldine, I'll have a go at VariantEval and getting into the plotting script! Cheers for all the great work you guys do • CaliforniaPosts: 8Member I think I am running into a similar problem myself, where lower tranches have lower Ti/Tv ratios. Perhaps given that for me this issue is occurring within the range of default tranches this might be an issue with the callset (?) but wanted to provide my figures in case it helps or if GATK team had other ideas. Attached also is the log file: Command Line: java -Xmx20g -jar GenomeAnalysisTK.jar -nt 8 -T VariantRecalibrator \ -R$SCRATCH/ucsc.hg19.fasta \
-input $SCRATCH/ihg-client.ucsf.edu/seielstadm/Measles_Genotyped_1000bp_capture_Gvcf.vcf \ -recalFile$SCRATCH/ihg-client.ucsf.edu/seielstadm/Measles_1000bp_4.recal \
-tranchesFile $SCRATCH/ihg-client.ucsf.edu/seielstadm/Measles_1000bp_4.tranches -resource:hapmap,known=false,training=true,truth=true,prior=15.0$SCRATCH/ihg-client.ucsf.edu/seielstadm/hapmap_3.3.hg19.vcf -resource:omni,known=false,training=true,truth=true,prior=12.0 $SCRATCH/ihg-client.ucsf.edu/seielstadm/1000G_omni2.5.hg19.vcf \ -resource:1000G,known=false,training=true,truth=false,prior=10.0$SCRATCH/ihg-client.ucsf.edu/seielstadm/1000G_phase1.snps.high_confidence.hg19.vcf \
-resource:dbsnp,known=true,training=false,truth=false,prior=2.0 $SCRATCH/ihg-client.ucsf.edu/seielstadm/dbsnp_137.hg19.vcf \ -an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an InbreedingCoeff -mode SNP \ -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 97.5 -tranche 95.0 -tranche 92.5 -tranche 90.0 -tranche 85.0 \ -rscriptFile recalibrate_SNP_plots2_1000bp_4.R Default tranches: Additional tranches: • Posts: 7,528Administrator, GATK Developer admin This does sound like it's caused by the dataset deviating from the expected Ti/Tv profile. In such a case I think you do need to look at the Ti/Tv distribution in your data more closely, with other tools since our script is not built to handle such cases. In future we'll see if we can make the script more flexible, but for now I'm afraid we can't devote resources to this. Sorry I don't have anything more useful for you... Good luck! Geraldine Van der Auwera, PhD • IndiaPosts: 36Member Hello! I am working on exome data and I have used the following command for variant recalibration for the same: java -Xmx4g -jar GenomeAnalysisTK-2.8-1-g932cd3a/GenomeAnalysisTK.jar -T VariantRecalibrator -R hg19_ref.fa -input 376_Unified_snp.vcf -recalFile 376_snp_1.output.recal -tranchesFile 376_snp_1.output.tranches --maxGaussians 3 -resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.hg19.sites.vcf -resource:omni,known=false,training=true,truth=true,prior=12.0 1000G_omni2.5.hg19.sites.vcf -resource:1000G,known=false,training=true,truth=false,prior=10.01000G_phase1.snps.high_confidence.hg19.vcf -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp_137.hg19.vcf -an QD -an HaplotypeScore -an FS -an MQ -mode SNP -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 97.5 -tranche 95.0 -tranche 92.5 -tranche 90.0 -tranche 85.0 -rscriptFile 376_recal_snp_1.plots.R And I am also attaching the tranches plot I get. I wanted to know that looking at the tranches plot which tranche would you suggest me to consider for futher analysis? According to me as it looks 97.5 tranche should be considered as its has highest amout of TPs considering the rest ones, however 99 tranche has cumulative TPs + tranche specfic TPs. So which tranche should be considered? My actual confusion is between Tranche specific TPs and Cumulative TPs here! Thank you! • Posts: 7,528Administrator, GATK Developer admin Hi there, Your plots look a little weird because the tranches are out of order; I believe this might be due to the fact that you're not specifying an exome-adapted titv target (which I guess we need to add to our docs explicitly as it's a common mistake). Setting the -titv argument to something like 3 for exome should make your plots look better (if my hunch is correct). This doesn't affect the actual recalibration, but affects how the titv estimation is done in the plots. Once your plots are straightened out I expect the 99 tranche will look like the best middle of the road tradeoff point, though if you are looking to favor either specificity or sensitivity you may want to go for a lower or higher tranche, respectively. The thing to understand about tranches is that they're basically slices of dataset. The higher you set your tranche cutoff, the more slices you accumulate on your plate. Tranche-specific TPs are true-positive calls that you gain when you add a slice to your plate. Cumulative TPs are true-positive calls that are contained in all the slices you have added already. So this differentiation allows you to evaluate, if you go to the next tranche up, how many more TPs do you gain vs. the additional FPs that you have to take on as part of the deal. Does that make sense? Geraldine Van der Auwera, PhD • IndiaPosts: 36Member Yes !! Thanks that makes a lot of sense!! • IndiaPosts: 36Member So by out of order you mean, the sensitivity order in the graph? • Posts: 7,528Administrator, GATK Developer admin @bioinfo_89 said: So by out of order you mean, the sensitivity order in the graph? Yes; the program orders the data according to per-tranche titv (high to low), and normally that leads to a nice ordering by sensitivity (low to high). But in cases such as yours the expected relationship does not hold, and so we see these seemingly disordered graphs. A while ago I thought this might be due to omitting the expected titv, leading to calculation anomalies, but now I'm not so sure. Geraldine Van der Auwera, PhD • Posts: 19Member I'm also seeing strangeness in our tranche plots that looks like it might have the same cause. It looks like this is caused by the R script ordering the data by TiTv ratio instead of the truth sensitivity tranches. Normally, the 2 follow the same ordering, but if you use a lot of tranches (or have wacky data), the TiTv ratio isn't always ordered the same. I believe that this can be fixed in the plotTranches.R script by changing: data2 = data2[order(data2$novelTiTv, decreasing=F),]


to:

data2 = data2[order(data2\$targetTruthSensitivity, decreasing=T),]


I've attached 2 files ("current.pdf" and "expected.pdf") that highlight the different outputs made by this change. Perhaps others who are more knowledgeable than I can weigh in on if this is the correct "expected" behavior of the tranche plot in this case.

Thanks,

John Wallace

@johnwallace123 Technically both are correct, but I think your plots do a great job of highlighting how much more intuitively readable the plots are when they are ordered by truth sensitivity instead of TiTv. I'll put in a ticket to have the script reviewed and hopefully changed accordingly. Thanks for contributing this.

Geraldine Van der Auwera, PhD

• IndiaPosts: 36Member

Hi!!

I have a doubt regarding the following output i have got when i used the follwoing Variant recalibration command:

java -Xmx4g -jar GenomeAnalysisTK-3.3-0.tar.bz2_FILES/GenomeAnalysisTK.jar -T VariantRecalibrator -R hg19_ref.fa -input raw_snp.vcf -recalFile snp_output.recal -tranchesFilesnp_output.tranches --maxGaussians 4 -resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.hg19.sites.vcf -resource:omni,known=false,training=true,truth=true,prior=12.0 1000G_omni2.5.hg19.sites.vcf -resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G_phase1.snps.high_confidence.hg19.vcf -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp_138.hg19_revised.vcf -an QD -an HaplotypeScore -an FS -an MQ -an MQRankSum -an ReadPosRankSum -mode SNP -rscriptFile recal_snp.plots.R -titv 3 -tranche 100 -tranche 99.9 -tranche 99 -tranche 90

And I got a tranches plot which is attached below.

My question is regarding tranche 90 where the block shows Cumulative TPs and FPs together and TPs and cumulative FPs together. What exactly does this mean? The TPs region is also shaded with orange lines.

Please guide me what is wrong with my graph.

Hi @bioinfo_89 ,

This is a common problem where the plots look strange because your tranches do not follow the expected titv ratio expectations, as discussed above. We have a cosmetic fix going in soon (thanks to @johnwallace!) so the sensitivity graph will look more sane. But that will not fix the underlying problem of your data, which has very low titv values.

Geraldine Van der Auwera, PhD

• IndiaPosts: 36Member

So what do you suggest about my data, if I want to use this for further analysis?

• IndiaPosts: 36Member

How can this be improved?

• IndiaPosts: 36Member

Please have a look at the plot which I have got for my data (exome) using Recalibration step below:

I am not quite sure about the tranche 90 in the graph. Also which tranche would you suggest to be considered?

Thank you!

• IndiaPosts: 36Member

The Titv scores are very low!!

• IndiaPosts: 36Member

I am also attaching the tranches output for your reference.

# Version number 5

targetTruthSensitivity,numKnown,numNovel,knownTiTv,novelTiTv,minVQSLod,filterName,model,accessibleTruthSites,callsAtTruthSites,truthSensitivity
90.00,24841,534,2.8894,0.7919,3.1376,VQSRTrancheSNP0.00to90.00,SNP,19189,17270,0.9000
99.00,28610,656,2.8046,0.7218,0.3253,VQSRTrancheSNP90.00to99.00,SNP,19189,18997,0.9900
99.50,29383,731,2.7843,0.6961,-0.4800,VQSRTrancheSNP99.00to99.50,SNP,19189,19093,0.9950
99.90,33344,1460,2.6352,0.9185,-5.2153,VQSRTrancheSNP99.50to99.90,SNP,19189,19169,0.9990
100.00,35255,2126,2.5657,0.8455,-1287.5865,VQSRTrancheSNP99.90to100.00,SNP,19189,19189,1.0000