To celebrate the release of GATK 4.0, we are giving away free credits for running the GATK4 Best Practices pipelines in FireCloud, our secure online analysis portal. It’s first come first serve, so sign up now to claim your free credits worth $250. Sponsored by Google Cloud. Learn more at https://software.broadinstitute.org/firecloud/documentation/freecredits

Tranches plot issue

Hi Geraldine,
I think that this question has been asked before but I cannot find the way to fix the problem. I have just run VariantRecalibrator tool, and I'm getting this (see attached file) tranches plot for SNPs. I have understood correctly the tranches plot of the best practices manual but, I don't know what is going on here. I know that the tranches are sorted by Ti/Tv but for my, this plot makes no sense.

This is my command:
java -Xmx4g -jar GenomeAnalysisTK.jar \
-T VariantRecalibrator -R human_g1k_v37.fasta \
-input all.joinGeno.raw.vcf \
-resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.b37.vcf \
-resource:omni,known=false,training=true,truth=true,prior=12.0 1000G_omni2.5.b37.vcf \
-resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G_phase1.snps.high_confidence.b37.vcf \
-resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp_138.b37.vcf \
-an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -mode SNP \
-tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 \
-recalFile recalibrate_SNP.recal \
-tranchesFile recalibrate_SNP.tranches \
-rscriptFile recalibrate_SNP_plots.R

Any help would be appreciated.

Best Answer

Answers

  • IrantzuIrantzu Member
    edited February 2015

    Thanks Geraldine. So, in conclusion, the command I have used is OK isn't it? I mean, in the documentation, I have read this: "This approach requires an additional truth data set, and cuts the VQSLOD at given sensitivities to the truth set...", "We have used hapmap 3.3 sites as the truth set (genotypes_r27_nr.b37_fwd.vcf)," ... I don't need to give as input another truth data set, do I?
    I don't understand also very well the meaning of TiTv parameter... maybe I've miss something in the manual. Could you please guide me?

    The last thing, if the command is OK, and everything is OK, the problem is that I have an excess of false positives. So, in my case, which tranche should I consider as the best one? Because I was planning to get the 99,9% as recommended but now I'm not sure.

    Thank you a lot.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Your basic command looks fine, but it's likely that there's something wrong with your data. You will need to do quality control analysis to figure out what is wrong (including checking distributions of various metrics per sample, cohort etc.). Unfortunately this is not something we provide recommendations for. It is not easy so if you don't know what to do you should seek help from a more experienced analyst. Good luck!

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Quick follow-up on this: we've realized that these effed-up tranche plots are caused by using recent version of dbsnp (anything post-1000G project). This happens because the VQSR plotting routines use hard-coded expectations about novel vs known TiTv and variant counts that were formulated before the 1000G results were added to dbsnp. It only affects the plots, so the underlying data may be perfectly ok even if the tranche plots look terrible. If you redo the VQSR steps with the older version of dbsnp that we provide in our resource bundle, the plotting problem should go away.

  • alegiaalegia ItalyMember

    Hi all,
    I have a similar issue with Variant Recalibration in my exome callset. Basically the rank of Ti/Tv ratios does not reflect the expected order of SNP tranches in my truth set (hapmap3). However, I think my plot (attached) looks probably better than the one posted by Irantzu, since only the tranche 90 does not follow the expected order and shows a Ti/Tv ratio sligttly below tranche 99. So just a few questions related to this (below):
    1. How is it possible that the VQSLOD score threshold of tranche 90 I slightly less conservative than the one of tranche 99? I usually see a huge difference in the number of called variants between tranches 90 and 99 (tranche 99 showing many more), but in this case I see roughly the same number of called variants...is this something that can happen?
    2. What shall I do in this case? Be as worried as Irantzu about the genuinity of my data and go deeper into QC, or go ahead selecting the most conservative tranche?
    3. Ti/Tv ratio are generally lower than expected for a WES experiment, but not that low..may these value be plausible if we consider there is a good capture of flanking intronic sequences in the experiment?

    Best,
    Alessandro
    (GATK version 3.5-0-g36282e4)

    java -Xmx8g -jar /bio/GenomeAnalysisTK.jar -T VariantRecalibrator -input ${1}/${1}.gvcf -R /data/shared/noPAR.hg19_decoy.fa \
    -an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR \
    -resource:hapmap,known=false,training=true,truth=true,prior=15.0 /data/shared/sort.hapmap_3.3.hg19.vcf.gz \
    -resource:omni,known=false,training=true,truth=false,prior=12.0 /data/shared/sort.1000G_omni2.5.hg19.vcf.gz \
    -resource:1000G,known=false,training=true,truth=false,prior=10.0 /data/shared/sort.1000G_phase1.snps.high_confidence.hg19.vcf.gz \
    -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 /data/shared/sort.dbsnp_137.hg19.vcf.gz \
    -mode SNP \
    -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 \
    --maxGaussians 6 --minNumBadVariants 4000 \
    -recalFile ${1}/${1}.SNP.recal \
    -tranchesFile ${1}/${1}.SNP.tranches \
    -rscriptFile ${1}/${1}_recalibrate_SNP_plots.R \
    &> ${1}/${1}.SNPrecalibration.log

  • alegiaalegia ItalyMember

    Hi again, to add something,
    I paste the content of the relevant .tranches file

    # Variant quality score tranches file

    Version number 5

    targetTruthSensitivity,numKnown,numNovel,knownTiTv,novelTiTv,minVQSLod,filterName,model,accessibleTruthSites,callsAtTruthSites,truthSensitivity
    90.00,177961,2437,2.6228,2.2665,2.3263,VQSRTrancheSNP0.00to90.00,SNP,91619,82457,0.9000
    99.00,275332,86012,2.6247,2.3185,-0.0884,VQSRTrancheSNP90.00to99.00,SNP,91619,90702,0.9900
    99.90,308197,146126,2.5579,1.8563,-9.7050,VQSRTrancheSNP99.00to99.90,SNP,91619,91527,0.9990
    100.00,312328,161315,2.5423,1.7058,-22057.1743,VQSRTrancheSNP99.90to100.00,SNP,91619,91619,1.0000

    Why do I see less novel variants here than in the tranche plot?
    Maybe something did not work properly in the script to produce such plot? This looks something similar to what @Geraldine_VdAuwera wrote in a previous post, but in this case I am using dbsnp only as a known callset..any suggestion?

    Thanks,
    Alessandro

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @alegia
    Hi Alessandro,

    Can you tell us what is in your input VCF? How many samples and are they whole exome or whole genome?

    Can you try with the latest version as well? There have been some changes that may affect your results.

    Lastly, you can try setting -resource:omni,known=false,training=true,truth=true,prior=12.0 /data/shared/sort.1000G_omni2.5.hg19.vcf.gz instead of -resource:omni,known=false,training=true,truth=false,prior=12.0 /data/shared/sort.1000G_omni2.5.hg19.vcf.gz

    -Sheila

  • alegiaalegia ItalyMember

    Hi Sheila, it's 159 WES samples in a single .gvcf file (from joint genotyping).
    I will need some time to install the latest GATK version on our server due to system admin policy, but in the meantime I used the omni callset as truth (as you suggested), which gave me a regular tranche plot (see exa tranche plot file attached).

    Variant quality score tranches file

    Version number 5

    targetTruthSensitivity,numKnown,numNovel,knownTiTv,novelTiTv,minVQSLod,filterName,model,accessibleTruthSites,callsAtTruthSites,truthSensitivity
    90.00,187590,3299,2.6410,2.3462,2.2796,VQSRTrancheSNP0.00to90.00,SNP,119128,107215,0.9000
    99.00,280861,96529,2.6305,2.3102,-0.1519,VQSRTrancheSNP90.00to99.00,SNP,119128,117936,0.9900
    99.90,310478,152136,2.5590,1.8309,-8.5073,VQSRTrancheSNP99.00to99.90,SNP,119128,119008,0.9990
    100.00,314548,166217,2.5444,1.6982,-33815.2690,VQSRTrancheSNP99.90to100.00,SNP,119128,119128,1.0000

    I also played a bit with stats to include in the recalibration and noticed that if I remove MQ and MQRankSum the tranche plot (ALLgoodSamples.SNP.tranches.pdf, attached) looks regular, and also the clustering of variants in the recalibration models is notably improved (see ALLgoodSamples_recalibrate_SNP_plots.R.pdf file).

    Although I am a bit reluctant about removing mapping quality informativity from the model, I have read in other discussions that sometimes such stats may be not informative and act as a confounder.
    Also, in a previous discussion (http://gatkforums.broadinstitute.org/gatk/discussion/6983/vqsr-error-no-mq-annotation-detected)@Geraldine_VdAuwera mentioned that removing MQ "will weaken the recalibration a bit but the results should still be valid".

    Do you think I could adopt this solution without reducing too much the quality of recalibration (I don't see any strange stat in the following variant callset evaluation).

    I am looking forward to hearing from you.

    Best,
    Alessandro

    Issue · Github
    by Sheila

    Issue Number
    1603
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    vdauwera
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    @alegia If the MQ-related annotations are destabilizing the model then it is indeed better to take them out. There might be something in the distribution of values for those annotations in your data that violates the model's assumptions. So you can always try doing some QC to see what's up with mapping quality in your data. But the underlying assignment of mapping qualities by aligners is not calibrated in a truly meaningful way, and it's always been the most problem-prone statistic in the bunch, so it's fairly common to have to take it out. If the model is behaving well with the remaining annotations then the recommendation is to just roll with it.

Sign In or Register to comment.