We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

Tranches plot completely off

Dear GATK team. First of all, thanks a lot for your continuous support and collaboration.

I just finished the joint genotyping of a set of ~3000 samples and got very disappointed when I took a look to the tranches plot, which is showing a vast majority of false positives in all tranches. Do you know where this may be coming from? Please find attached the SNP and INDELS plot as well as the mentioned tranches plot.

Best Answers

Answers

  • SheilaSheila Broad InstituteMember, Broadie ✭✭✭✭✭

    @simonsanchezj
    Hi,

    These plots do look off. Can you tell me what types of samples you are working with? Also, please post the exact command lines you used for each tool starting with Haplotype Caller.

    Thanks,
    Sheila

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Ack, plots have disappeared -- this is a bug of the forum software I think. Can you please repost them?

    Make sure you used the *_129 version of dsnp that we provide in our resource bundle as a known sites file. There are some hard coded expectations in the plotting script that affect the plotting results (but not the recalibration results). We'll document this more fully in the near future.

  • simonsanchezjsimonsanchezj GermanyMember

    Dear both. Please find attached the plots again. As for the commands used for the genotyping and recalibration, please see below. Also, I was using dbSNP138. Will retry with 139 as mentioned.
    Thanks a lot

    ----------------------------------------------

    Perform joint genotyping using GenotypeGVCFs

    ----------------------------------------------

    echo "date: Run GATK's GenotypeGVCFs to perform the joint genotyping"
    module load java/1.7.0
    ls "$VCF_FILES"*.raw.snps.indels.g.vcf > "$VCF_FILES"gVCFs.list
    srun java -Xmx"$MEM"g -jar "$GATK" \
    -T GenotypeGVCFs \
    -nt "$PROC" \
    -R "$REFERENCE" \
    --variant "$VCF_FILES"gVCFs.list \
    --dbsnp "$DBSNP" \
    -o "$VCF_FILES"All_samples_raw.snps.indels.vcf

    wait
    echo "date: GATK -> Joint genotyping done. A single .vcf file has been created for all samples."

    --------------------------------

    Variant recalibration for SNPs

    --------------------------------

    echo "date: Run VSQR to perform variant filtering for SNPs"
    module load java/1.7.0
    srun java -Xmx"$MEM"g -jar "$GATK" \
    -T VariantRecalibrator \
    -nt "$PROC" \
    -R "$REFERENCE" \
    -input "$VCF_FILES"All_samples_raw.snps.indels.vcf \
    -resource:hapmap,known=false,training=true,truth=true,prior=15.0 "$HAPMAP" \
    -resource:omni,known=false,training=true,truth=true,prior=12.0 "$OMNI" \
    -resource:1000G,known=false,training=true,truth=false,prior=10.0 "$SNP_1KG" \
    -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 "$DBSNP" \
    -recalFile "$VCF_FILES"All_samples_recalibrate_SNP.recal \
    -tranchesFile "$VCF_FILES"All_samples_recalibrate_SNP.tranches \
    -rscriptFile "$VCF_FILES"All_samples_recalibrate_SNP.plots.R \
    -an QD \
    -an MQ \
    -an FS \
    -an MQRankSum \
    -an ReadPosRankSum \
    -mode SNP \
    -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0

    wait
    echo "date: GATK -> SNP recalibration modeling finished."

    ----------------------------------------------------------------------

    Apply the desired level of recalibration to the SNPs in the call set

    ----------------------------------------------------------------------

    echo "date: Apply recalibration to SNPs"
    module load java/1.7.0
    srun java -Xmx"$MEM"g -jar "$GATK" \
    -T ApplyRecalibration \
    -R "$REFERENCE" \
    -input "$VCF_FILES"All_samples_raw.snps.indels.vcf \
    -mode SNP \
    --ts_filter_level "$VQSLOD" \
    -recalFile "$VCF_FILES"All_samples_recalibrate_SNP.recal \
    -tranchesFile "$VCF_FILES"All_samples_recalibrate_SNP.tranches \
    -o "$VCF_FILES"All_samples_recalibrated_snps_raw_indels.vcf

    wait
    echo "date: GATK -> SNPs recalibrated succesfully."

    ----------------------------------

    Variant recalibration for indels

    ----------------------------------

    echo "date: Run VSQR to perform variant filtering for indels"
    module load java/1.7.0
    srun java -Xmx"$MEM"g -jar "$GATK" \
    -T VariantRecalibrator \
    -nt "$PROC" \
    -R "$REFERENCE" \
    -input "$VCF_FILES"All_samples_recalibrated_snps_raw_indels.vcf \
    --maxGaussians 4 \
    -resource:mills,known=false,training=true,truth=true,prior=12.0 "$INDEL_MILLS" \
    -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 "$DBSNP" \
    -an QD \
    -an FS \
    -an MQRankSum \
    -an ReadPosRankSum \
    -mode INDEL \
    -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 \
    -recalFile "$VCF_FILES"All_samples_recalibrate_INDEL.recal \
    -tranchesFile "$VCF_FILES"All_samples_recalibrate_INDEL.tranches \
    -rscriptFile "$VCF_FILES"All_samples_recalibrate_INDEL.plots.R

    wait
    echo "date: GATK -> Indel recalibration modeling finished."

    ------------------------------------------------------------------------

    Apply the desired level of recalibration to the indels in the call set

    ------------------------------------------------------------------------

    echo "date: Apply recalibration to indels"
    module load java/1.7.0
    srun java -Xmx"$MEM"g -jar "$GATK" \
    -T ApplyRecalibration \
    -R "$REFERENCE" \
    -input "$VCF_FILES"All_samples_recalibrated_snps_raw_indels.vcf \
    -mode INDEL \
    --ts_filter_level "$VQSLOD" \
    -recalFile "$VCF_FILES"All_samples_recalibrate_INDEL.recal \
    -tranchesFile "$VCF_FILES"All_samples_recalibrate_INDEL.tranches \
    -o "$VCF_FILES"All_samples_recalibrated_variants.vcf

    wait
    echo "date: GATK -> Indels recalibrated succesfully."

    -----------------------------

    Perform read-backed phasing

    -----------------------------

    if [ "$READBACK" = "yes" ]
    then
    echo date: "Use GATK's ReadBackedPhasing to phase SNVs based on read information from the .bam files"
    ls BAM_files/hg19/*.bam > "$VCF_FILES"BAMs.list
    module load java/1.7.0
    srun java -Xmx"$MEM"g -jar "$GATK" \
    -T ReadBackedPhasing \
    -R "$REFERENCE" \
    -I "$VCF_FILES"BAMs.list \
    --variant "$VCF_FILES"All_samples_recalibrated_variants.vcf \
    -o "$VCF_FILES"All_samples_recalibrated_variants_phased.vcf \
    --phaseQualityThresh 20.0

    echo "`date`:  GATK -> SNVs variants succesfully phased. Please note that ReadBackedPhasing does not support insertions, deletions, or multi-nucleotide polymorphisms."
    

    else
    echo "date: Back phasing will be skipped."
    fi

    ---------------------------

    Delete intermediate files

    ---------------------------

    if [ -f "$VCF_FILES"All_samples_recalibrated_variants.vcf ]
    then
    rm "$VCF_FILES"All_samples_recalibrate_SNP.recal
    rm "$VCF_FILES"All_samples_recalibrate_SNP.recal.idx
    rm "$VCF_FILES"All_samples_recalibrate_INDEL.recal
    rm "$VCF_FILES"All_samples_recalibrate_INDEL.recal.idx
    rm "$VCF_FILES"All_samples_raw.snps.indels.vcf
    rm "$VCF_FILES"All_samples_raw.snps.indels.vcf.idx
    rm "$VCF_FILES"All_samples_recalibrated_snps_raw_indels.vcf
    rm "$VCF_FILES"All_samples_recalibrated_snps_raw_indels.vcf.idx
    rm countreads.txt
    rm "$VCF_FILES"BAMs.list
    rm "$VCF_FILES"gVCFs.list
    echo "date: File All_samples_recalibrated_variants.vcf found. All intermediate files deleted"
    echo "date: Pipeline completed. A raw gvcf file has been created for each sample. Recalibrated variants written to All_samples_recalibrated_variants.vcf."
    else
    echo "All_samples_recalibrated_variants.vcf not found"
    fi

  • simonsanchezjsimonsanchezj GermanyMember

    By the way, I cannot find dbSNP_139 in the resource bundle.

    Thanks

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Look for 129, not 139.

  • simonsanchezjsimonsanchezj GermanyMember

    Hello,

    In 2.8 bundle, I can only find dbsnp_138.hg19 and dbsnp_138.hg19.excluding_sites_after_129. In the mentioned pipeline I used the first one. Do you recommend me using the second one? (i.e aith all SNPs after 129 removed)?

    Thanks a lot.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Yes, you want dbsnp_138.hg19.excluding_sites_after_129. Sorry for the lack of clarity.

  • simonsanchezjsimonsanchezj GermanyMember

    Thanks a lot. Will try this out and will let you know.

  • simonsanchezjsimonsanchezj GermanyMember

    Hello, I tried again with the recommended SNP set and got the attached plots. Much better than the previous ones but still not as god as expected. Are these acceptable?

    Thanks

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @simonsanchezj, sorry for the late reply. Those plots look much better! You still seem to have a lot of FPs (and some of your model plots are a little messy) but there's one more thing The tool runs by default with a an expected titv of 2.1 iirc, which is the setting for whole genome. If this is exome data you're running on, you should set the -titv argument explicitly to something higher (typically between 2.5 and 3 depending on your capture kit). Please try that and let me know if it improves the results any further.

  • simonsanchezjsimonsanchezj GermanyMember

    Thanks for your answer, Gerlaldine. I used several capture kits from this project, namely, NImblegen V4 and Agilent V5. What -titv should I use then? Also, to which GATK walkers should I apply this option?

    Thanks a lot

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @simonsanchezj I don't know the specific values for those kits, but they're pretty standard exomes iirc, so you can try running VariantRecalibrator with -titv 2.8.

  • simonsanchezjsimonsanchezj GermanyMember

    Dear Geraldine,

    I ran it with -titv 2.8 and got the plots attached. Also, the number of resulting variants is much bigger than before (from 346064 to 732595). Is this something expected? In the GATK manual it is mentioned that the -titv is only for plotting purposes, but this is clearly changing my output. Is this expected?

    Again, thanks a lot for your help

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    No, that's definitely not expected. Can you clarify what you mean by "the number of resulting variants"? Do you mean variants passing filters, or number of variants in the file? The total number of variants in the file should not change during recalibration. The number of variants that are considered novel may change if you use a different version of dbsnp, so make sure you use the same version between runs that you compare.

  • simonsanchezjsimonsanchezj GermanyMember

    Hello Geraldine,

    Sorry for the confusion. I double checked and found a mistake in my script. Anyway, I ran my analysis without -titv argument, and with 2.5 and 2.8. As you pointed there is only a small difference in the number of PASSed variants. HOwever, when looking at the tranches plot, it looks like lower -titv numbers yield better results (lees number of false positives). Does this make sense? See attached.

  • simonsanchezjsimonsanchezj GermanyMember

    Hi Geraldine,

    A large subset of our samples comes from old (low coverage) exome data, so that would be the cause. Would you consider these values as acceptable?

    Thanks

Sign In or Register to comment.