result of VQSR - negative number og novel variants

Hi, I have run VQSR on my SNP data and got negative number of novel variants. I attached the tranches file.

I am working on whole exome data and here are my commands.

java -Xmx4g -jar GenomeAnalysisTK.jar \
-T VariantRecalibrator \
-R IsaacIndex/genome.fa \
-input data_raw.vcf \
-recalFile data_snp.recal.vcf \
-tranchesFile vqsr_snp.tranches \
-nt 8 \
-resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.hg19.sites.vcf \
-resource:omni,known=false,training=true,truth=true,prior=12.0 1000G_omni2.5.hg19.sites.vcf \
-resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G_phase1.snps.high_confidence.hg19.sites.vcf \
-resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp_138.hg19.vcf \
-an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR -an InbreedingCoeff \
-mode SNP \

java -Xmx3g -jar GenomeAnalysisTK.jar \
-T ApplyRecalibration \
-R IsaacIndex/genome.fa \
-input data_raw.vcf \
-tranchesFile vqsr_snp.tranches \
-recalFile data_snp.recal.vcf \
-o data_snp.vqsr.vcf \
--ts_filter_level 99.5 \
-mode SNP \

java -Xmx4g -jar GenomeAnalysisTK.jar \
-T VariantRecalibrator \
-R IsaacIndex/genome.fa \
-input data_raw.vcf \
-recalFile data_indel.recal.vcf \
-tranchesFile vqsr_indel.tranches \
-nt 8 \
--maxGaussians 4 \
-resource:mills,known=false,training=true,truth=true,prior=12.0 Mills_and_1000G_gold_standard.indels.hg19.sites.vcf \
-resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp_138.hg19.vcf \
-an QD -an FS -an SOR -an ReadPosRankSum -an MQRankSum -an InbreedingCoeff \
-mode INDEL \

java -Xmx3g -jar GenomeAnalysisTK.jar \
-T ApplyRecalibration \
-R IsaacIndex/genome.fa \
-input data_raw.vcf \
-tranchesFile vqsr_indel.tranches \
-recalFile data_indel.recal.vcf \
-o data_indel.vqsr.vcf \
--ts_filter_level 99.0 \
-mode INDEL \

Tagged:

Best Answer

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    How was the vcf of variants created? Was it with HaplotypeCaller or something else?

  • Kelly135Kelly135 KoreaMember

    When using "dbsnp_138.hg19.excluding_sites_after_129.vcf" instead, number of variants became positive values, but still it looks strange. I attached the results again.

    And Geraldine, these vcfs are created using HaplotypeCaller and here are my commands. Sorry, it's too long!

    java -jar picard-tools-1.96/SortSam.jar SORT_ORDER=coordinate VALIDATION_STRINGENCY=SILENT INPUT=sample1.sam OUTPUT=sample1.bam TMP_DIR=tmp

    java -jar picard-tools-1.96/MarkDuplicates.jar REMOVE_DUPLICATES=false CREATE_MD5_FILE=true VALIDATION_STRINGENCY=SILENT CREATE_INDEX=true I=sample1.bam O=sample1.dup.bam METRICS_FILE=sample1.dup.metrics

    java -jar GenomeAnalysisTK.jar -T RealignerTargetCreator -R IsaacIndex/genome.fa -known Mills_Devine_2hit.indels.hg19.vcf -I sample1.dup.bam -o sample1.realign.intervals

    java -jar GenomeAnalysisTK.jar -T IndelRealigner -R IsaacIndex/genome.fa -targetIntervals sample1.realign.intervals -I sample1.dup.bam -o sample1.realndup.bam

    java -jar GenomeAnalysisTK.jar -T BaseRecalibrator -R IsaacIndex/genome.fa -knownSites dbsnp_132.hg19.vcf -knownSites Mills_Devine_2hit.indels.hg19.vcf -I sample1.realndup.bam -o sample1.recal.table

    java -jar GenomeAnalysisTK.jar -T PrintReads -R IsaacIndex/genome.fa -I sample1.realndup.bam -BQSR sample1.recal.table -o sample1.recalrealndup.bam

    java -jar GenomeAnalysisTK.jar -T BaseRecalibrator -R IsaacIndex/genome.fa -knownSites Mills_Devine_2hit.indels.hg19.vcf -I sample1.recalrealndup.bam -knownSites dbsnp_132.hg19.vcf -BQSR sample1.recal.table -o sample1.recalafter.table

    java -jar GenomeAnalysisTK.jar -T AnalyzeCovariates -R IsaacIndex/genome.fa -before sample1.recal.table -after sample1.recalafter.table -csv sample1.recalafter.csv -plots sample1.recalplots.pdf -l DEBUG

    java -jar picard-tools-1.96/CollectInsertSizeMetrics.jar VALIDATION_STRINGENCY=SILENT REFERENCE_SEQUENCE=IsaacIndex/genome.fa I=sample1.recalrealndup.bam O=sample1.recalrealndup.insertSize.tsv HISTOGRAM_FILE=sample1.recalrealndup.insertSize.histo.pdf METRIC_ACCUMULATION_LEVEL=LIBRARY

    java -jar picard-tools-1.96/CollectAlignmentSummaryMetrics.jar VALIDATION_STRINGENCY=SILENT REFERENCE_SEQUENCE=IsaacIndex/genome.fa I=sample1.recalrealndup.bam O=sample1.recalrealndup.metric.alignment.tsv METRIC_ACCUMULATION_LEVEL=LIBRARY

    java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R IsaacIndex/genome.fa -I sample1.recalrealndup.bam -o sample1.g.vcf -ERC GVCF --variant_index_type LINEAR --variant_index_parameter 128000 -nct 2

    # I did the same process for other samples as above, and merged them into one vcf file.

    java -jar GenomeAnalysisTK.jar -T GenotypeGVCFs -R IsaacIndex/genome.fa -nt 20 --variant sample1to150.list --dbsnp dbsnp_132.hg19.vcf -o data_raw.vcf

    java -jar GenomeAnalysisTK.jar -T SelectVariants -R IsaacIndex/genome.fa -V data_raw.vcf -o raw_snp.vcf -selectType SNP

    java -jar GenomeAnalysisTK.jar -T SelectVariants -R IsaacIndex/genome.fa -V data_raw.vcf -o raw_indel.vcf -selectType INDEL

    Then VQSR as I mentioned in the first post.

  • Kelly135Kelly135 KoreaMember

    I just found that processing whole exome data requires restricting target regions, which I didn't. So I believe it could be the cause of high false-positives. I will re-analyze the data again. Still, if you have any other opinion that I should refer to, please let me know.
    Thanks for your help!

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    When you have your new callset (with intervals restricted) you should change your VQSR workflow a bit to avoid separating out SNPs and indels. See the workshop VQSR presentation slide decks for a detailed explanation of how you should set it up.

Sign In or Register to comment.