Attention:
The frontline support team will be slow on the forum because we are occupied with a GATK Workshop on March 26th and 27th 2019. We will be back and available to answer questions on the forum on March 28th 2019.

ApplyRecalibration error, during snps recalibration

manolismanolis Member ✭✭
edited April 2018 in Ask the GATK team

GATK 4.0.3.0

Hi,

I have a problem with ApplyRecalibration:

A USER ERROR has occurred: Encountered input variant which isn't found in the input recal file. Please make sure VariantRecalibrator and ApplyRecalibration were run on the same set of input variants. First seen at: [VC Unknown @ chr1:12899 Q75.51 of type=SNP alleles=[A*, C] attr={AC=2, AF=0.040, AN=50, DP=66, ExcessHet=0.0445, FS=0.000, InbreedingCoeff=0.1386, MLEAC=3, MLEAF=0.060, MQ=21.65, QD=25.17, SOR=2.833} GT=[] filters=

Here my pipeline:

1A) "Hard Filtering"

input="0.raw.vcf"
output="1.HF_raw.vcf"

${ph6} --java-options -Xmx25g VariantFiltration --filter-expression 'ExcessHet > 54.69' --filter-name 'ExcessHet' -V ${input} -O ${output}

1A) "Make Sites Only Vcf"

input="1.HF_raw.vcf"
output="2.HFso_raw.vcf"

${ph6} --java-options -Xmx25g MakeSitesOnlyVcf -I ${input} -O ${output}

input="2.HFso_raw.vcf"
output="HFso_indel.recal.vcf"
tranches="indels.IVR.tranches"

2A) "Indels Variant Recalibrator"

${ph6} --java-options -Xmx100g VariantRecalibrator -V ${input} -O ${output} --tranches-file ${tranches} -trust-all-polymorphic -tranche "100.0" -tranche "99.95" -tranche "99.9" -tranche "99.5" -tranche "99.0" -tranche "97.0" -tranche "96.0" -tranche "95.0" -tranche "94.0" -tranche "93.5" -tranche "93.0" -tranche "92.0" -tranche "91.0" -tranche "90.0" -an "FS" -an "ReadPosRankSum" -an "MQRankSum" -an "QD" -an "SOR" -an "DP" -mode INDEL --max-gaussians 4 -resource mills,known=false,training=true,truth=true,prior=12:${sorgsorg}/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz -resource axiomPoly,known=false,training=true,truth=false,prior=10:${sorgsorg}/Axiom_Exome_Plus.genotypes.all_populations.poly.hg38.vcf.gz -resource dbsnp,known=true,training=false,truth=false,prior=2:${sorgsorg}/Homo_sapiens_assembly38.dbsnp138.vcf

2B) "SNPs Variant Recalibrator Create Model"

input="2.HFso_raw.vcf"
output="HFso_snp.recal.vcf"
tranches="snps.SVRM.tranches"
model="snps_model.SVRM.report"

${ph6} --java-options -Xmx125g VariantRecalibrator -V ${input} -O ${output} --tranches-file ${tranches} -trust-all-polymorphic -tranche "100.0" -tranche "99.95" -tranche "99.9" -tranche "99.8" -tranche "99.6" -tranche "99.5" -tranche "99.4" -tranche "99.3" -tranche "99.0" -tranche "98.0" -tranche "97.0" -tranche "90.0" -an "QD" -an "MQRankSum" -an "ReadPosRankSum" -an "FS" -an "MQ" -an "SOR" -an "DP" -mode SNP -sample-every "10" --output-model ${model} --max-gaussians 6 -resource hapmap,known=false,training=true,truth=true,prior=15:${sorgsorg}/hapmap_3.3.hg38.vcf.gz -resource omni,known=false,training=true,truth=true,prior=12:${sorgsorg}/1000G_omni2.5.hg38.vcf.gz -resource 1000G,known=false,training=true,truth=false,prior=10:${sorgsorg}/1000G_phase1.snps.high_confidence.hg38.vcf.gz -resource dbsnp,known=true,training=false,truth=false,prior=7:${sorgsorg}/Homo_sapiens_assembly38.dbsnp138.vcf

2C) "SNPsVariantRecalibrator"

input="2.HFso_raw.vcf"
output="provaoutput.vcf"
tranches="snps.SVR.tranches"
model="snps_model.SVR.report"

${ph6} --java-options -Xmx50g VariantRecalibrator -V ${input} -O ${output} --tranches-file ${tranches} -trust-all-polymorphic -tranche "100.0" -tranche "99.95" -tranche "99.9" -tranche "99.8" -tranche "99.6" -tranche "99.5" -tranche "99.4" -tranche "99.3" -tranche "99.0" -tranche "98.0" -tranche "97.0" -tranche "90.0" -an "QD" -an "MQRankSum" -an "ReadPosRankSum" -an "FS" -an "MQ" -an "SOR" -an "DP" -mode SNP --input-model ${model} --max-gaussians 6 -resource hapmap,known=false,training=true,truth=true,prior=15:${sorgsorg}/hapmap_3.3.hg38.vcf.gz -resource omni,known=false,training=true,truth=true,prior=12:${sorgsorg}/1000G_omni2.5.hg38.vcf.gz -resource 1000G,known=false,training=true,truth=false,prior=10:${sorgsorg}/1000G_phase1.snps.high_confidence.hg38.vcf.gz -resource dbsnp,known=true,training=false,truth=false,prior=7:${sorgsorg}/Homo_sapiens_assembly38.dbsnp138.vcf

2D)"ApplyRecalibration"

input="2.HFso_raw.vcf"
outin="tmp.indel.recalibrated.vcf"
output="2.HFso_recalibrated.vcf"
indelrecal="HFso_indel.recal.vcf"
snprecal="HFso_snp.recal.vcf"
indeltranches="indels.IVR.tranches"
snptranches="snps.SVR.tranches"

${ph6} --java-options -Xmx50g ApplyVQSR -O ${outin} -V ${input} --recal-file ${indelrecal} -tranches-file ${indeltranches} -truth-sensitivity-filter-level "99.7" --create-output-variant-index true -mode INDEL

${ph6} --java-options -Xmx50g ApplyVQSR -O ${output} -V ${outin} --recal-file ${snprecal} -tranches-file ${snptranches} -truth-sensitivity-filter-level "99.7" --create-output-variant-index true -mode SNP

Any suggestion?

In the forum I found only old conversations, no about GATK4, if I'm correct.

Many thanks

Issue · Github
by Sheila

Issue Number
3066
State
closed
Last Updated
Assignee
Array
Closed By
chandrans

Best Answers

  • manolismanolis ✭✭
    edited April 2018 Accepted Answer

    About the ApplyVQSR step

    in the input file "2.HFso_raw.vcf" (it is the output file of the "Make Sites Only Vcf" step) are present 7,136,488 records (header + variants).

    chr1:12899 is the second variant of the file:

    FIRST VARIANT
    chr1 10492 rs55998931 C T 91.72 PASS AC=3;AF=0.107;AN=28;BaseQRankSum=-9.670e-01;ClippingRankSum=0.00;DB;DP=35;ExcessHet=0.2482;FS=0.000;InbreedingCoeff=0.1983;MLEAC=8;MLEAF=0.286;MQ=43.87;MQRankSum=0.967;QD=18.34;ReadPosRankSum=0.967;SOR=0.223

    SECOND VARIANT
    chr1 12899 . A C 75.51 PASS AC=2;AF=0.040;AN=50;DP=66;ExcessHet=0.0445;FS=0.000;InbreedingCoeff=0.1386;MLEAC=3;MLEAF=0.060;MQ=21.65;QD=25.17;SOR=2.833

    In the output file are present 3426 lines (the header) + only one variant (the first variant of the input file)

    chr1 10492 rs55998931 C T 91.72 VQSRTrancheSNP99.90to99.95 AC=3;AF=0.107;AN=28;BaseQRankSum=-9.670e-01;ClippingRankSum=0.00;DB;DP=35;ExcessHet=0.2482;FS=0.000;InbreedingCoeff=0.1983;MLEAC=8;MLEAF=0.286;MQ=43.87;MQRankSum=0.967;NEGATIVE_TRAIN_SITE;QD=18.34;ReadPosRankSum=0.967;SOR=0.223;VQSLOD=-4.602e+00;culprit=MQRankSum

    No other data are present in the output vcf.

    Many thanks

  • manolismanolis ✭✭
    edited June 2018 Accepted Answer

    Hi Laura,

    thank you for your answers! I will let you know.

    All the best

  • manolismanolis ✭✭
    Accepted Answer

    Hi, now seems that everything is ok with VQSR-GATK4 in my pipe!

    Thank you very much Laura and @Sheila

Answers

  • manolismanolis Member ✭✭
    edited April 2018 Accepted Answer

    About the ApplyVQSR step

    in the input file "2.HFso_raw.vcf" (it is the output file of the "Make Sites Only Vcf" step) are present 7,136,488 records (header + variants).

    chr1:12899 is the second variant of the file:

    FIRST VARIANT
    chr1 10492 rs55998931 C T 91.72 PASS AC=3;AF=0.107;AN=28;BaseQRankSum=-9.670e-01;ClippingRankSum=0.00;DB;DP=35;ExcessHet=0.2482;FS=0.000;InbreedingCoeff=0.1983;MLEAC=8;MLEAF=0.286;MQ=43.87;MQRankSum=0.967;QD=18.34;ReadPosRankSum=0.967;SOR=0.223

    SECOND VARIANT
    chr1 12899 . A C 75.51 PASS AC=2;AF=0.040;AN=50;DP=66;ExcessHet=0.0445;FS=0.000;InbreedingCoeff=0.1386;MLEAC=3;MLEAF=0.060;MQ=21.65;QD=25.17;SOR=2.833

    In the output file are present 3426 lines (the header) + only one variant (the first variant of the input file)

    chr1 10492 rs55998931 C T 91.72 VQSRTrancheSNP99.90to99.95 AC=3;AF=0.107;AN=28;BaseQRankSum=-9.670e-01;ClippingRankSum=0.00;DB;DP=35;ExcessHet=0.2482;FS=0.000;InbreedingCoeff=0.1983;MLEAC=8;MLEAF=0.286;MQ=43.87;MQRankSum=0.967;NEGATIVE_TRAIN_SITE;QD=18.34;ReadPosRankSum=0.967;SOR=0.223;VQSLOD=-4.602e+00;culprit=MQRankSum

    No other data are present in the output vcf.

    Many thanks

  • edited April 2018

    Hi Sheila,

    The issue occurring when running on SNPs.

    I ran:

    1) "> Hard Filtering" -> OK
    2) "> Make Sites Only Vcf" -> OK
    3) "> Indels Variant Recalibrator" -> OK
    4) "> Indels Apply Recalibration" -> OK
    5) "> SNPs Variant Recalibration" -> OK
    6) "> SNPs Apply Recalibration" -> ERROR

    Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx50g -jar /share/apps/bio/gatk-4.0.3.0/gatk-package-4.0.3.0-local.jar ApplyVQSR -V tmp.recalibrated.vcf -O WES_hg38_ChrnoGChrChr_x53_recalibrated.vcf --recal-file WES_hg38_ChrnoGChrChr_x53_HFso_snp.recal.vcf -tranches-file VR_snps.tranches -truth-sensitivity-filter-level 99.7 --create-output-variant-index true -mode SNP
    13:08:09.802 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/share/apps/bio/gatk-4.0.3.0/gatk-package-4.0.3.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
    13:08:10.023 INFO ApplyVQSR - ------------------------------------------------------------
    13:08:10.024 INFO ApplyVQSR - The Genome Analysis Toolkit (GATK) v4.0.3.0
    13:08:10.024 INFO ApplyVQSR - For support and documentation go to https://software.broadinstitute.org/gatk/
    13:08:10.025 INFO ApplyVQSR - Executing as [email protected] on Linux v3.5.0-36-generic amd64
    13:08:10.025 INFO ApplyVQSR - Java runtime: Java HotSpot(TM) 64-Bit Server VM v1.8.0_91-b14
    13:08:10.025 INFO ApplyVQSR - Start Date/Time: April 28, 2018 1:08:09 PM CEST
    13:08:10.026 INFO ApplyVQSR - ------------------------------------------------------------
    13:08:10.026 INFO ApplyVQSR - ------------------------------------------------------------
    13:08:10.027 INFO ApplyVQSR - HTSJDK Version: 2.14.3
    13:08:10.027 INFO ApplyVQSR - Picard Version: 2.17.2
    13:08:10.028 INFO ApplyVQSR - HTSJDK Defaults.COMPRESSION_LEVEL : 2
    13:08:10.028 INFO ApplyVQSR - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
    13:08:10.028 INFO ApplyVQSR - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
    13:08:10.028 INFO ApplyVQSR - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
    13:08:10.028 INFO ApplyVQSR - Deflater: IntelDeflater
    13:08:10.028 INFO ApplyVQSR - Inflater: IntelInflater
    13:08:10.029 INFO ApplyVQSR - GCS max retries/reopens: 20
    13:08:10.029 INFO ApplyVQSR - Using google-cloud-java patch 6d11bef1c81f885c26b2b56c8616b7a705171e4f from https://github.com/droazen/google-cloud-java/tree/dr_all_nio_fixes
    13:08:10.029 INFO ApplyVQSR - Initializing engine
    13:08:11.026 INFO FeatureManager - Using codec VCFCodec to read file file:///home/manolis/GATK4/IlluminaExomePairEnd/8.VQSR/WES_hg38_BQSRChrHCnoGGDBIChrGGVCFChr_x53/WES_hg38_ChrnoGChrChr_x53_HFso_snp.recal.vcf
    13:08:11.273 INFO FeatureManager - Using codec VCFCodec to read file file:///home/manolis/GATK4/IlluminaExomePairEnd/8.VQSR/WES_hg38_BQSRChrHCnoGGDBIChrGGVCFChr_x53/tmp.recalibrated.vcf
    13:08:11.924 INFO ApplyVQSR - Done initializing engine
    13:08:11.933 INFO ApplyVQSR - Read tranche TruthSensitivityTranche targetTruthSensitivity=90.00 minVQSLod=3.2791 known=(359721 @ 2.1877) novel=(86323 @ 0.8847) truthSites(202681 accessible, 182412 called), name=VQSRTrancheSNP0.00to90.00]
    13:08:11.934 INFO ApplyVQSR - Read tranche TruthSensitivityTranche targetTruthSensitivity=97.00 minVQSLod=0.7138 known=(411070 @ 2.1791) novel=(160275 @ 0.9057) truthSites(202681 accessible, 196600 called), name=VQSRTrancheSNP90.00to97.00]
    13:08:11.936 INFO ApplyVQSR - Read tranche TruthSensitivityTranche targetTruthSensitivity=98.00 minVQSLod=-0.0625 known=(416491 @ 2.1784) novel=(166762 @ 0.9059) truthSites(202681 accessible, 198627 called), name=VQSRTrancheSNP97.00to98.00]
    13:08:11.937 INFO ApplyVQSR - Read tranche TruthSensitivityTranche targetTruthSensitivity=99.00 minVQSLod=-1.4013 known=(423912 @ 2.1778) novel=(173295 @ 0.9051) truthSites(202681 accessible, 200654 called), name=VQSRTrancheSNP98.00to99.00]
    13:08:11.938 INFO ApplyVQSR - Read tranche TruthSensitivityTranche targetTruthSensitivity=99.30 minVQSLod=-1.9042 known=(425903 @ 2.1774) novel=(175018 @ 0.9046) truthSites(202681 accessible, 201262 called), name=VQSRTrancheSNP99.00to99.30]
    13:08:11.938 INFO ApplyVQSR - Read tranche TruthSensitivityTranche targetTruthSensitivity=99.40 minVQSLod=-2.0577 known=(426513 @ 2.1777) novel=(175535 @ 0.9044) truthSites(202681 accessible, 201464 called), name=VQSRTrancheSNP99.30to99.40]
    13:08:11.939 INFO ApplyVQSR - Read tranche TruthSensitivityTranche targetTruthSensitivity=99.50 minVQSLod=-2.2347 known=(427220 @ 2.1776) novel=(176043 @ 0.9045) truthSites(202681 accessible, 201667 called), name=VQSRTrancheSNP99.40to99.50]
    13:08:11.939 INFO ApplyVQSR - Read tranche TruthSensitivityTranche targetTruthSensitivity=99.60 minVQSLod=-2.4204 known=(427994 @ 2.1776) novel=(176627 @ 0.9045) truthSites(202681 accessible, 201870 called), name=VQSRTrancheSNP99.50to99.60]
    13:08:11.940 INFO ApplyVQSR - Read tranche TruthSensitivityTranche targetTruthSensitivity=99.80 minVQSLod=-2.9787 known=(430079 @ 2.1770) novel=(178158 @ 0.9041) truthSites(202681 accessible, 202275 called), name=VQSRTrancheSNP99.60to99.80]
    13:08:11.941 INFO ApplyVQSR - Read tranche TruthSensitivityTranche targetTruthSensitivity=99.90 minVQSLod=-3.9022 known=(432612 @ 2.1762) novel=(180804 @ 0.9039) truthSites(202681 accessible, 202478 called), name=VQSRTrancheSNP99.80to99.90]
    13:08:11.941 INFO ApplyVQSR - Read tranche TruthSensitivityTranche targetTruthSensitivity=99.95 minVQSLod=-5.5068 known=(436103 @ 2.1738) novel=(188117 @ 0.9079) truthSites(202681 accessible, 202579 called), name=VQSRTrancheSNP99.90to99.95]
    13:08:11.943 INFO ApplyVQSR - Read tranche TruthSensitivityTranche targetTruthSensitivity=100.00 minVQSLod=-39253.0534 known=(439834 @ 2.1700) novel=(198400 @ 0.9109) truthSites(202681 accessible, 202681 called), name=VQSRTrancheSNP99.95to100.00]
    13:08:12.013 INFO ApplyVQSR - Keeping all variants in tranche TruthSensitivityTranche targetTruthSensitivity=99.80 minVQSLod=-2.9787 known=(430079 @ 2.1770) novel=(178158 @ 0.9041) truthSites(202681 accessible, 202275 called), name=VQSRTrancheSNP99.60to99.80]
    13:08:12.135 INFO ProgressMeter - Starting traversal
    13:08:12.136 INFO ProgressMeter - Current Locus Elapsed Minutes Variants Processed Variants/Minute
    13:08:12.211 INFO ApplyVQSR - Shutting down engine
    [April 28, 2018 1:08:12 PM CEST] org.broadinstitute.hellbender.tools.walkers.vqsr.ApplyVQSR done. Elapsed time: 0.04 minutes.
    Runtime.totalMemory()=2281701376


    A USER ERROR has occurred: Encountered input variant which isn't found in the input recal file. Please make sure VariantRecalibrator and ApplyRecalibration were run on the same set of input variants. First seen at: [VC Unknown @ chr1:12899 Q75.51 of type=SNP alleles=[A*, C] attr={AC=2, AF=0.040, AN=50, DP=66, ExcessHet=0.0445, FS=0.000, InbreedingCoeff=0.1386, MLEAC=3, MLEAF=0.060, MQ=21.65, QD=25.17, SOR=2.833} GT=[] filters=


    Set the system property GATK_STACKTRACE_ON_USER_EXCEPTION (--java-options '-DGATK_STACKTRACE_ON_USER_EXCEPTION=true') to print the stack trace.

    If I also performed:

    1) "> Hard Filtering" -> OK
    2) "> Make Sites Only Vcf" -> OK
    3) "> SNPs Variant Recalibration" -> OK
    4) "> SNPs Apply Recalibration" -> ERROR

    Can I sent you the log file of all steps?

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @manolis_emmanouil
    Hi,

    I am not sure what is going on. I need to check with the team and get back to you.

    -Sheila

  • manolismanolis Member ✭✭

    Hi @Sheila

    do you have any news? I started from 0 and I created a new file to use as input in the VQSR step...

    A USER ERROR has occurred: Encountered input variant which isn't found in the input recal file. Please make sure VariantRecalibrator and ApplyRecalibration were run on the same set of input variants. First seen at: [VC Unknown @ chr1:14653 Q111.46 of type=SNP alleles=[C*, T] attr={AC=2, AF=0.500, AN=4, BaseQRankSum=1.50, ClippingRankSum=0.00, DP=26, ExcessHet=4.7712, FS=8.368, MLEAC=2, MLEAF=0.500, MQ=35.25, MQRankSum=2.47, QD=4.29, ReadPosRankSum=-3.450e-01, SOR=2.833} GT=[] filters=
    

    Please, could you help me!

    • as above, with the indel steps I don't have any error!

    Many thanks

  • manolismanolis Member ✭✭
    edited May 2018

    Here all steps... please, could you check the code :neutral: ... probably I'm doing something wrong but I cann't find it!

            java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx10g -jar gatk-4.0.4.0/gatk-package-4.0.4.0-local.jar VariantFiltration --filter-expression ExcessHet > 54.69 --filter-name ExcessHet -O 7.VQSR/prova_rawHF.vcf -V 6.vcf/storage/PROVA_WESx2_Chr_noG_20180524/PROVA_WESx2_Chr_noG_20180524_raw.vcf
    
    
            java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx3g -jar gatk-4.0.4.0/gatk-package-4.0.4.0-local.jar MakeSitesOnlyVcf -I 7.VQSR/prova_rawHF.vcf -O 7.VQSR/prova_rawHFSO.vcf
    
    
        java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx100g -jar gatk-4.0.4.0/gatk-package-4.0.4.0-local.jar VariantRecalibrator -V 7.VQSR/prova_rawHFSO.vcf -O 7.VQSR/prova_rawHFSOindel.vcf --tranches-file indel_tranches.txt --trust-all-polymorphic -tranche 100.0 -tranche 99.95 -tranche 99.9 -tranche 99.5 -tranche 99.0 -tranche 97.0 -tranche 96.0 -tranche 95.0 -tranche 94.0 -tranche 93.5 -tranche 93.0 -tranche 92.0 -tranche 91.0 -tranche 90.0 -an FS -an ReadPosRankSum -an MQRankSum -an QD -an SOR -an DP -mode INDEL --max-gaussians 4 -resource mills,known=false,training=true,truth=true,prior=12:Mills_and_1000G_gold_standard.indels.hg38.vcf.gz -resource axiomPoly,known=false,training=true,truth=false,prior=10:Axiom_Exome_Plus.genotypes.all_populations.poly.hg38.vcf.gz -resource dbsnp,known=true,training=false,truth=false,prior=2:Homo_sapiens_assembly38.dbsnp138.vcf
    
    
          java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx125g -jar gatk-4.0.4.0/gatk-package-4.0.4.0-local.jar VariantRecalibrator -V 7.VQSR/prova_rawHFSO.vcf -O 7.VQSR/prova_rawHFSOsnp.vcf --tranches-file snp.tranches -trust-all-polymorphic -tranche 100.0 -tranche 99.95 -tranche 99.9 -tranche 99.8 -tranche 99.6 -tranche 99.5 -tranche 99.4 -tranche 99.3 -tranche 99.0 -tranche 98.0 -tranche 97.0 -tranche 90.0 -an QD -an MQRankSum -an ReadPosRankSum -an FS -an MQ -an SOR -an DP -mode SNP -sample-every 10 --output-model snp.model --max-gaussians 6 -resource hapmap,known=false,training=true,truth=true,prior=15:hapmap_3.3.hg38.vcf.gz -resource omni,known=false,training=true,truth=true,prior=12:1000G_omni2.5.hg38.vcf.gz -resource 1000G,known=false,training=true,truth=false,prior=10:1000G_phase1.snps.high_confidence.hg38.vcf.gz -resource dbsnp,known=true,training=false,truth=false,prior=7:Homo_sapiens_assembly38.dbsnp138.vcf
    
    
        java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx25g -jar gatk-4.0.4.0/gatk-package-4.0.4.0-local.jar ApplyVQSR -O tmp.indel.recalibrated.vcf -V 7.VQSR/prova_rawHFSO.vcf --recal-file 7.VQSR/prova_rawHFSOindel.vcf --tranches-file indel_tranches.txt --truth-sensitivity-filter-level 99.7 --create-output-variant-index true -mode INDEL
    
    
      -jar gatk-4.0.4.0/gatk-package-4.0.4.0-local.jar ApplyVQSR -O 7.VQSR/prova_rawHFSO_VQSR.vcf -V tmp.indel.recalibrated.vcf --recal-file 7.VQSR/prova_rawHFSOsnp.vcf --tranches-file snp.tranches --truth-sensitivity-filter-level 99.7 --create-output-variant-index true -mode SNP
    
  • manolismanolis Member ✭✭

    Please @Sheila,

    let me know! I have ready the whole GATK4-pipeline and the related samples outputs from each step, Missing only the last, final step !!!

  • gauthiergauthier Member, Broadie, Moderator, Dev admin

    Hi @manolis ,

    This is a slightly different VQSR workflow than the classic version, which presumably you got from a slightly older version of the GATK joint calling WDL? That version was designed for a 20,000 genome callset so there are some fancy data gymnastics to keep the memory requirements in check. Specifically, the "SNPs Variant Recalibrator Create Model" step was added for this large callset. While the classic VQSR workflow creates a model for the data and applies it to the data to produce a .recal output, this workflow creates a model based on a subsample of the data (note the -sample-every "10" argument), serializes that model, and then applies the model to each scattered shard of data to create shards of .recal files. Then those .recal files are used in ApplyVQSR. This issue with your workflow is that you're supplying ApplyVQSR with the .recal file from the subsampled model creation, so it only contains every 10th variant. If you change snprecal to "provaoutput.vcf" (from the "SNPsVariantRecalibrator" step instead of "HFso_snp.recal.vcf" from the "SNPs Variant Recalibrator Create Model" step) then ApplyVQSR should work. Otherwise please let us know.

    We've since updated the joint calling WDLs to only use this model creation workflow for more than 10,000 samples (see line 216 of https://github.com/gatk-workflows/broad-prod-wgs-germline-snps-indels/blob/master/JointGenotypingWf.wdl and line 260 of https://github.com/gatk-workflows/gatk4-germline-snps-indels/blob/master/joint-discovery-gatk4.wdl). The results are equally good for smaller cohorts (VariantRecalibrator downsamples internally to 2.5M variants anyway), but using the old workflow for cohorts that aren't enormous removes a lot of unnecessary complexity.

    Cheers,
    Laura

  • manolismanolis Member ✭✭

    Hi Laura,

    thank you very much for the explenatios! You were very clear about the new VQSR pipe, useful for very large cohorts! Where can I find the "old workflow"?

    Thank you very muth,
    Manolis

  • manolismanolis Member ✭✭

    I have to install gatk3 for using the classic/old VQSR pipe, or deactivate the -sample-every "n" option in the gatk4?.

  • manolismanolis Member ✭✭
    edited June 2018

    Laura, please, see also the code on 24 May. It was the last one that I was trying to use... What I have to change on it?

  • gauthiergauthier Member, Broadie, Moderator, Dev admin

    If you copy from the latest workflow from https://github.com/gatk-workflows/broad-prod-wgs-germline-snps-indels/blob/master/JointGenotypingWf.wdl or https://github.com/gatk-workflows/gatk4-germline-snps-indels/blob/master/joint-discovery-gatk4.wdl then the script will default to the "old workflow" if you have fewer than 10,000 samples.

    Specifically, if you want to reimplement the WDL in your own script, you can use GATK4, but to do the classic VQSR pipeline your SNP VariantRecalibrator command has to change. You should take out -sample-every 10 --output-model snp.model so that it looks like the command from the WDL above for num_gvcfs < 10000. It looks like the output and input filenames already match up.

  • manolismanolis Member ✭✭
    edited June 2018 Accepted Answer

    Hi Laura,

    thank you for your answers! I will let you know.

    All the best

  • manolismanolis Member ✭✭
    Accepted Answer

    Hi, now seems that everything is ok with VQSR-GATK4 in my pipe!

    Thank you very much Laura and @Sheila

  • init_jsinit_js Member
    edited June 2018

    Thank you for the pointers @gauthier

    I need to scatter the VariantRecalibrator step across my data to have it complete in a timely fashion. I've got too much raw VCFs to go through in a single pass. The strategy I am shooting for is to fan-out VariantRecalibrator to obtain M different recal files, and M different tranche files, followed by GatherTranches.

    I'm not going to use --sample-every-Nth-variant, as it reduces the number of sites considered, I'd like to visit all sites.

    As far as I understand GatherTranches will combine the M tranches into a single gathered tranche file T, but will leave the M recal files split up.

    ApplyVQSR takes as input parameters both a recal file and a tranche file, and can be part of a scatter-gather. In the particular case of the pipeline linked above, It looks like each scattered call to ApplyVQSR receives the same unified tranches file T, but a different recal file. Is that correct? (https://github.com/gatk-workflows/broad-prod-wgs-germline-snps-indels/blob/5585cdf7877104f2c61b2720ddfe7235f2fad577/JointGenotypingWf.wdl#L229 )

    I'm not familiar enough with this particular WDL pipeline to determine whether the two scatter operations (i.e. VariantRecalibrator, ApplyVQSR) can have a different degree of fanout, or if they have to be the same. They seem to iterate over a different range of indices. Generally speaking, Is it the case that if a VCF file F was used to produce a recal file R_f, R_f should also be used as a recal parameter when running ApplyVQSR on any input data that covers F (e.g. F.vcf, Gather(F.vcf + G.vcf), etc) ?

    Also -- the WDL pipeline uses scatter only for SNPs. Indels use the classic version. Any special reason for that?

  • gauthiergauthier Member, Broadie, Moderator, Dev admin

    Hi @init_js ,

    I wouldn't recommend scattering VariantRecalibrator in that way because each call will build a different model and different shards of your data will effectively be filtered using different "rules". That's the reason my large genome callset workflow has one VariantRecalibrator call sampling 10% of the data. It builds one model that then gets applied to each shard in the second VariantRecalibrator call. (You're right that ApplyVQSR takes the gathered tranches but the sharded recals. We need to gather the tranches so we set a uniform threshold for filtering across the entire genome.)

    When you say that you have too much data in your raw VCFs to do a single pass, did you trim the VCFs to sites-only first? (You can use the Picard MakeSitesOnlyVcf tool or the --sites-only output argument we just ported to the GATK4 engine.) Even with 20,000 whole genomes, running VariantRecalibrator on the whole sites-only file was still reasonably tractable, but we did need to do that sampling to keep the memory requirements in check. (VariantRecalibrator downsamples internally anyway, but this way we don't need to hold all the variants in memory first.) For indels we didn't need to downsample because we could still load all the variants on a really beefy VM, but for the next, larger iteration of the gnomAD callset we'll probably have to do the same for indels.

    You are correct that R_f would be required to ApplyVQSR to any data containing F (to use your notation) -- you'll get errors like the one at the top of this thread if you don't. The recal file should have as many variants as the input (accounting for type SNP or indel) with the score for each. Those score are effectively ranking the variants by how confident we are in their correctness. In order to actually apply filters in the ApplyVQSR step, we need the variants to filter (F), the scores from the model (recal R_f) and the (gathered) tranches. If you want to run ApplyVQSR on Gather(F.vcf + G.vcf) then you'll probably also need to gather R_f and R_g because I think ApplyVQSR will only take one recal file.

    Also, if you're going to use GatherTranches make sure you use GATK version 4.0.1.1 or later -- I had a bug in the previous versions that made the tranches less accurate at the lower sensitivity range.

    -Laura

  • init_jsinit_js Member
    edited June 2018

    I see! We've been passing the output of GenotypeGVCFs directly into VariantRecalibrator. Did not know inputs to VariantRecalibrator had to be trimmed to sites-only. Will try that. Re: Too much data, the experience we've had running VariantRecalibrator over all the genome in one go, is that it stops making progress about halfway through, slowed down to a grinding halt. We considered building the model over just part of the genome instead (say Chr01-to-Chr08) to work around the limitation, but it's arguably better to let it see everything. Our dataset comes from approx 2000 plant samples.

    VariantRecalibrator has a --sites-only-vcf-output argument, but that's on the output side. GenotypeGVCFs also has --sites-only-vcf-output. But that probably means we'd have to call GenotypeGVCFs twice (once with all info, once site only). Thanks. Will opt for full VCFs, and then convert them to sites-only.

    I see the HardFilterAndMakeSitesOnlyVcf step in the pipeline also calls VariantFiltration (with excesshet) before producing the sites-only vcf. How does that affect the model? https://github.com/gatk-workflows/broad-prod-wgs-germline-snps-indels/blob/5585cdf7877104f2c61b2720ddfe7235f2fad577/JointGenotypingWf.wdl#L415

    Will MakeSitesOnlyVcf be compatible with Allele specific annotations (will it keep them?)

    JS

  • init_jsinit_js Member
    edited June 2018

    I was confused by the statement:

    I wouldn't recommend scattering VariantRecalibrator in that way because each call will build a different model and different shards of your data will effectively be filtered using different "rules". That's the reason my large genome callset workflow has one VariantRecalibrator call sampling 10% of the data.

    I looked back at the WDL and it did seem, contrary to the statement, like the call to VariantRecalibrator was scattered (here, https://github.com/gatk-workflows/broad-prod-wgs-germline-snps-indels/blob/5585cdf7877104f2c61b2720ddfe7235f2fad577/JointGenotypingWf.wdl#L164 ).

    I then realized that VariantRecalibrator was used in two modes, and I was conflating the two. That wasn't really clear from the docs. It is first used to build a model when --output-model ${model_report_filename} is given. That call should not be scattered. And then it is used again in a scattered way, this time specifying --input-model {output_report} --output-tranches-for-scatter for each shard.

  • gauthiergauthier Member, Broadie, Moderator, Dev admin

    VariantRecalibrator doesn't look at any of the format-field data, so it's just extra memory to read it all in. MakeSitesOnlyVcf "chops off" the format field, leaving all the annotations (including allele-specific ones) intact.

    Sorry about the lack of documentation for how the --output-model and --input-model arguments are used in the workflow. The large sample number branch of the workflow was a recent addition to the production pipeline that we haven't formalized as part of the GATK Best Practices yet.

Sign In or Register to comment.