The current GATK version is 3.2-2

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Powered by Vanilla. Made with Bootstrap.
Due to a minor hiccup in the release process for version 3.3, the tool documentation page is currently malfunctioning. We're working on getting that fixed asap.

Tranches plot with tranches less than 90

Otago, New ZealandPosts: 4Member
edited March 24

Hi, I am planning to filter with ApplyRecal very stringently (I don't mind losing a lot of SNPs to reduce false positives), but I am having trouble viewing the tranches plot produced by VariantRecal to choose my cutoff level. When I run the following command I get the default visualisation of 90, 99, 99.9 & 100 with cumulative TPs & FPs, and everything looks fine:

java -Xmx8g -jar $gatk \ -T VariantRecalibrator \ -R$ref \ -input $vcf \ -resource:hapmap,known=false,training=true,truth=true,prior=15.0$hapmap \ -resource:omni,known=false,training=true,truth=false,prior=12.0 $onek \ -resource:dbsnp,known=true,training=false,truth=false,prior=8.0$dbsnp \ --minNumBadVariants 1000 \ -an QD \ -an MQRankSum \ -an ReadPosRankSum \ -an MQ \ -an FS \ -an DP \ -mode SNP \ -recalFile ${haplocall_date}_all_sites.snp.recal \ -tranchesFile${haplocall_date}_all_sites.snp.tranches \ --rscript_file ${haplocall_date}_all_sites.snp.rscript \ 1>snp_recal_all_sites.out 2>snp_recal_all_sites.err & But when I add in the following flags after '-mode SNP' to change the tranches, the plot displays them out of order (and the Ti/Tv ratio doesn't seem to correlate as I expected): -tranche 80.0 -tranche 85.0 -tranche 90.0 -tranche 99.0 \ This means the cumulative FP/TP display is no longer appropriate and is making it hard to tell what the actual proportions are at each level. When I specify fewer tranches, it displays them in the same descending order no matter how many and in what order I specify them: The total number of variants that get through the filter also seems to change depending on what tranches are specified (eg the 85 cutoff has ~50K in the second graph and ~30 in the third graph); I would have thought it would be the same for the same dataset, but I might be misunderstanding something. 1-Why does the number of variants change when the same tranch is called in a different set of tranches? (And the Ti/Tv ratio seems to decrease regardless of whether the tranch is above or below 90)? 2- Why does the default tranch plot show ascending order of the tranches, but when I specify my own cutoffs it never does? 3- Alternatively, is there a way to remove the display of the cumulative TP/FPs to make the plot easier to read? 4- Or perhaps a simpler solution, can I bypass the plot altogether? Does one of the outputs from VariantRecal have the data (or a way of calculating it) about TP and FP predicted variants that is used to produce these plots so I can just compare the numbers? Thanks very much, and let me know if I need to clarify anything Post edited by scc on Tagged: Answers • Posts: 6,418Administrator, GATK Developer admin Hi there, 1-2) If I recall correctly, the tranches are plotted in order following the predicted Ti/Tv, not by the tranche percentage number. When you use the default tranches, Ti/Tv is generally expected to decrease as the tranche percentage increases, which is why the ordering is so neat. I'm not sure why the other (much lower) tranches have Ti/Tv ratios that are not in line with that trend. Meaning I don't know whether that corresponds to a biological reality, shows an issue with the callset, or is a bug. I will ask @rpoplin, the author of the VQSR, to weigh in on this. I do think there may be a bug in how the numbers are summed due to the unexpected ordering of tranches, which would explain the discrepancy between total numbers between plots. 3-4) You can certainly hack the plotting script to modify it if you'd like. The source is available on our github repository (see FAQs). You can use the tables produced by VariantRecalibrator to plot the data whichever way you want it. And you can use VariantEval to do a more detailed evaluation of your callset quality. We certainly recommend it. Geraldine Van der Auwera, PhD • Otago, New ZealandPosts: 4Member Thanks very much Geraldine, I'll have a go at VariantEval and getting into the plotting script! Cheers for all the great work you guys do • CaliforniaPosts: 8Member I think I am running into a similar problem myself, where lower tranches have lower Ti/Tv ratios. Perhaps given that for me this issue is occurring within the range of default tranches this might be an issue with the callset (?) but wanted to provide my figures in case it helps or if GATK team had other ideas. Attached also is the log file: Command Line: java -Xmx20g -jar GenomeAnalysisTK.jar -nt 8 -T VariantRecalibrator \ -R$SCRATCH/ucsc.hg19.fasta \ -input $SCRATCH/ihg-client.ucsf.edu/seielstadm/Measles_Genotyped_1000bp_capture_Gvcf.vcf \ -recalFile$SCRATCH/ihg-client.ucsf.edu/seielstadm/Measles_1000bp_4.recal \ -tranchesFile $SCRATCH/ihg-client.ucsf.edu/seielstadm/Measles_1000bp_4.tranches -resource:hapmap,known=false,training=true,truth=true,prior=15.0$SCRATCH/ihg-client.ucsf.edu/seielstadm/hapmap_3.3.hg19.vcf -resource:omni,known=false,training=true,truth=true,prior=12.0 $SCRATCH/ihg-client.ucsf.edu/seielstadm/1000G_omni2.5.hg19.vcf \ -resource:1000G,known=false,training=true,truth=false,prior=10.0$SCRATCH/ihg-client.ucsf.edu/seielstadm/1000G_phase1.snps.high_confidence.hg19.vcf \ -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 \$SCRATCH/ihg-client.ucsf.edu/seielstadm/dbsnp_137.hg19.vcf \ -an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an InbreedingCoeff -mode SNP \ -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 97.5 -tranche 95.0 -tranche 92.5 -tranche 90.0 -tranche 85.0 \ -rscriptFile recalibrate_SNP_plots2_1000bp_4.R

Default tranches:

Additional tranches:

• Posts: 6,418Administrator, GATK Developer admin

This does sound like it's caused by the dataset deviating from the expected Ti/Tv profile. In such a case I think you do need to look at the Ti/Tv distribution in your data more closely, with other tools since our script is not built to handle such cases. In future we'll see if we can make the script more flexible, but for now I'm afraid we can't devote resources to this. Sorry I don't have anything more useful for you... Good luck!

Geraldine Van der Auwera, PhD

Sign In or Register to comment.