We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

VQSR: low TiTv

Hi

I'm trying out VQSR on a batch of 16 human whole genomes (~25-30x). I was wondering if someone could review the below profiles. It seems the false-positive rate is much higher than the GATK examples.

Has anyone else experienced similar results? Any possible solutions?

Here are the commands used with GATK-3.7.0:

#Build the SNP recalibration model /share/apps/jre-distros/jre1.8.0_101/bin/java -Djava.io.tmpdir=/state/partition1/tmpdir -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -Xmx8g -jar /share/apps/GATK-distros/GATK_3.7.0/GenomeAnalysisTK.jar \ -T VariantRecalibrator \ -R /state/partition1/db/human/gatk/2.8/b37/human_g1k_v37.fasta \ -input "$seqId"_variants.vcf \ -resource:hapmap,known=false,training=true,truth=true,prior=15.0 /state/partition1/db/human/gatk/2.8/b37/hapmap_3.3.b37.vcf \ -resource:omni,known=false,training=true,truth=true,prior=12.0 /state/partition1/db/human/gatk/2.8/b37/1000G_omni2.5.b37.vcf \ -resource:1000G,known=false,training=true,truth=false,prior=10.0 /state/partition1/db/human/gatk/2.8/b37/1000G_phase1.snps.high_confidence.b37.vcf \ -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 /state/partition1/db/human/gatk/2.8/b37/dbsnp_138.b37.vcf \ -an DP \ -an QD \ -an FS \ -an SOR \ -an MQ \ -an MQRankSum \ -an ReadPosRankSum \ -an InbreedingCoeff \ -mode SNP \ -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 \ -recalFile "$seqId"_SNP.recal \ -tranchesFile "$seqId"_SNP.tranches \ -rscriptFile "$seqId"_SNP_plots.R \ -L 1 -L 2 -L 3 -L 4 -L 5 -L 6 -L 7 -L 8 -L 9 -L 10 -L 11 -L 12 -L 13 -L 14 -L 15 -L 16 -L 17 -L 18 -L 19 -L 20 -L 21 -L 22 -L X -L Y -L MT \ -nt 12 \ -ped "$seqId"_pedigree.ped \ -dt NONE

#Apply the desired level of recalibration to the SNPs in the call set /share/apps/jre-distros/jre1.8.0_101/bin/java -Djava.io.tmpdir=/state/partition1/tmpdir -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -Xmx4g -jar /share/apps/GATK-distros/GATK_3.7.0/GenomeAnalysisTK.jar \ -T ApplyRecalibration \ -R /state/partition1/db/human/gatk/2.8/b37/human_g1k_v37.fasta \ -input "$seqId"_variants.vcf \ -mode SNP \ --ts_filter_level 99.0 \ -recalFile "$seqId"_SNP.recal \ -tranchesFile "$seqId"_SNP.tranches \ -o "$seqId"_recalibrated_snps_raw_indels.vcf \ -L 1 -L 2 -L 3 -L 4 -L 5 -L 6 -L 7 -L 8 -L 9 -L 10 -L 11 -L 12 -L 13 -L 14 -L 15 -L 16 -L 17 -L 18 -L 19 -L 20 -L 21 -L 22 -L X -L Y -L MT \ -nt 12 \ -ped "$seqId"_pedigree.ped \ -dt NONE

#Build the Indel recalibration model /share/apps/jre-distros/jre1.8.0_101/bin/java -Djava.io.tmpdir=/state/partition1/tmpdir -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -Xmx4g -jar /share/apps/GATK-distros/GATK_3.7.0/GenomeAnalysisTK.jar \ -T VariantRecalibrator \ -R /state/partition1/db/human/gatk/2.8/b37/human_g1k_v37.fasta \ -input "$seqId"_recalibrated_snps_raw_indels.vcf \ -resource:mills,known=false,training=true,truth=true,prior=12.0 /state/partition1/db/human/gatk/2.8/b37/Mills_and_1000G_gold_standard.indels.b37.vcf \ -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 /state/partition1/db/human/gatk/2.8/b37/dbsnp_138.b37.vcf \ -an DP \ -an QD \ -an FS \ -an SOR \ -an MQRankSum \ -an ReadPosRankSum \ -an InbreedingCoeff \ -mode INDEL \ -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 \ --maxGaussians 4 \ -recalFile "$seqId"_INDEL.recal \ -tranchesFile "$seqId"_INDEL.tranches \ -rscriptFile "$seqId"_INDEL_plots.R \ -L 1 -L 2 -L 3 -L 4 -L 5 -L 6 -L 7 -L 8 -L 9 -L 10 -L 11 -L 12 -L 13 -L 14 -L 15 -L 16 -L 17 -L 18 -L 19 -L 20 -L 21 -L 22 -L X -L Y -L MT \ -nt 12 \ -ped "$seqId"_pedigree.ped \ -dt NONE

#Apply the desired level of recalibration to the Indels in the call set /share/apps/jre-distros/jre1.8.0_101/bin/java -Djava.io.tmpdir=/state/partition1/tmpdir -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -Xmx4g -jar /share/apps/GATK-distros/GATK_3.7.0/GenomeAnalysisTK.jar \ -T ApplyRecalibration \ -R /state/partition1/db/human/gatk/2.8/b37/human_g1k_v37.fasta \ -input "$seqId"_recalibrated_snps_raw_indels.vcf \ -mode INDEL \ --ts_filter_level 99.0 \ -recalFile "$seqId"_INDEL.recal \ -tranchesFile "$seqId"_INDEL.tranches \ -o "$seqId"_recalibrated_variants.vcf \ -L 1 -L 2 -L 3 -L 4 -L 5 -L 6 -L 7 -L 8 -L 9 -L 10 -L 11 -L 12 -L 13 -L 14 -L 15 -L 16 -L 17 -L 18 -L 19 -L 20 -L 21 -L 22 -L X -L Y -L MT \ -nt 12 \ -ped "$seqId"_pedigree.ped \ -dt NONE

Best Answer

Answers

  • SheilaSheila Broad InstituteMember, Broadie ✭✭✭✭✭

    @CardiffBioinf
    Hi,

    I will respond soon. I need some time to look at your commands and figure out what could be causing this.

    -Sheila

  • CardiffBioinfCardiffBioinf CardiffMember

    Hi Sheila

    Thanks for your assistance. I re-ran from the beginning using GATK 3.8.0 and the dbsnp_138.b37.excluding_sites_after_129.vcf. It looks much better (attached) although I also noted the sequencing quality was not perfect. Interestingly I have a lot more variants than before,so probably a mistake on my part.

    Many thanks
    Matt

Sign In or Register to comment.