Heads up:
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Variant Recalibrator Error

Hello everyone,

I am using the VariantRecalibrator tool in the following command:

"
export Gold="/ibex/scratch/alsaedsb/Task_w1/ref/Mills_and_1000G_gold_standard.indels.b37.vcf"
export dbsnp="/ibex/scratch/alsaedsb/Task_w1/ref/dbsnp_138.b37.vcf"

gatk VariantRecalibrator -V $VCF/${PREFIX}_SNP_INDLS_only_VCF.vcf -O $RECAL/${PREFIX}_INDEL_HG.recal --tranches-file $RECAL/${PREFIX}_INDEL_HG.tranche --trust-all-polymorphic -an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR -mode INDEL --max-gaussians 4 --resource:mills,known=false,training=true,truth=true,prior=12 ${Gold} --resource:dbsnp,known=false,training=true,truth=true,prior=2.0 ${dbsnp}"

and have submitted the script on slurm and I got the following error:

"A USER ERROR has occurred: The argument: "resource/resource" does not accept tags: "Mills,known=false,training=true,truth=true,prior=12""


I have tried many ways to solve that , but the error is there, May any one help me to solve it ?
Tagged:

Answers

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    @Sakhaa

    Which version of GATK are you using?

  • SakhaaSakhaa Member
  • SakhaaSakhaa Member
    Hi @bhanuGandham
    I am using : gatk 4.1.2.0
  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @Sakhaa

    Can you please post the entire error log.

  • SakhaaSakhaa Member
    edited August 13
    sure @bhanuGandham

    "[VCF]$ gatk VariantRecalibrator -V S_18_2123_NIST-8398_AD008_L008_SNP_INDLS_only_VCF.vcf -O $RECAL/INDEL_HG.recal --tranches-file $RECAL/INDEL_HG.tranche --trust-all-polymorphic -an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR -mode INDEL --max-gaussians 4 --resource:Mills,known=false,training=true,truth=true,prior=12 /ibex/scratch/alsaedsb/Task_w1/ref/Mills_and_1000G_gold_standard.indels.b37.vcf --resource:dbsnp,known=false,training=true,truth=true,prior=2.0 ${dbsnp}
    Using GATK jar /sw/csi/gatk/4.0.1.1/el7_binary/bin/gatk-package-4.0.1.1-local.jar
    Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=1 -jar /sw/csi/gatk/4.0.1.1/el7_binary/bin/gatk-package-4.0.1.1-local.jar VariantRecalibrator -V S_18_2123_NIST-8398_AD008_L008_SNP_INDLS_only_VCF.vcf -O /ibex/scratch/alsaedsb/BEST_WF/RECAL/INDEL_HG.recal --tranches-file /ibex/scratch/alsaedsb/BEST_WF/RECAL/INDEL_HG.tranche --trust-all-polymorphic -an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR -mode INDEL --max-gaussians 4 --resource:Mills,known=false,training=true,truth=true,prior=12 /ibex/scratch/alsaedsb/Task_w1/ref/Mills_and_1000G_gold_standard.indels.b37.vcf --resource:dbsnp,known=false,training=true,truth=true,prior=2.0 /ibex/scratch/alsaedsb/Task_w1/ref/dbsnp_138.b37.vcf
    USAGE: VariantRecalibrator [arguments]

    Build a recalibration model to score variant quality for filtering purposes
    Version:4.0.1.1"
  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin
  • SakhaaSakhaa Member
    Hi @bhanuGandham

    I have changed the GATK version to v4.1.2.0


    and my command is :

    " gatk VariantRecalibrator
    -V SNP_INDLS_only_VCF_T.vcf
    -O Test_snps_recal_data.recal
    --rscript-file Test_recalibrate_SNP_plots.R
    --tranches-file VR_Test_snps.tranches
    --trust-all-polymorphic
    -an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR -mode SNP
    -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0
    --max-gaussians 6
    --resource:hapmap,known=false,training=true,truth=true,prior=15.0 ${hapmap}
    --resource:omni,known=false,training=true,truth=true,prior=12.0 ${Omi2}
    --resource:1000G,known=false,training=true,truth=false,prior=10.0 ${G1000}
    --resource:dbsnp,known=true,training=false,truth=false,prior=2.0 ${dbSNP}"


    and I got the new error and this is my log

    "17:29:38.736 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/sw/csi/gatk/4.1.2.0/el7.5_binary_Py2.7env/gatk-4.1.2.0/gatk-package-4.1.2.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
    Aug 30, 2019 5:29:40 PM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine
    INFO: Failed to detect whether we are running on Google Compute Engine.
    17:29:40.762 INFO VariantRecalibrator - ------------------------------------------------------------
    17:29:40.763 INFO VariantRecalibrator - The Genome Analysis Toolkit (GATK) v4.1.2.0
    17:29:40.764 INFO VariantRecalibrator - For support and documentation go to https://software.broadinstitute.org/gatk/
    17:29:40.764 INFO VariantRecalibrator - Executing as [email protected] on Linux v3.10.0-957.12.1.el7.x86_64 amd64
    17:29:40.764 INFO VariantRecalibrator - Java runtime: Java HotSpot(TM) 64-Bit Server VM v1.8.0_162-b12
    17:29:40.765 INFO VariantRecalibrator - Start Date/Time: August 30, 2019 5:29:38 PM AST
    17:29:40.765 INFO VariantRecalibrator - ------------------------------------------------------------
    17:29:40.765 INFO VariantRecalibrator - ------------------------------------------------------------
    17:29:40.766 INFO VariantRecalibrator - HTSJDK Version: 2.19.0
    17:29:40.766 INFO VariantRecalibrator - Picard Version: 2.19.0
    17:29:40.767 INFO VariantRecalibrator - HTSJDK Defaults.COMPRESSION_LEVEL : 2
    17:29:40.767 INFO VariantRecalibrator - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
    17:29:40.767 INFO VariantRecalibrator - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
    17:29:40.767 INFO VariantRecalibrator - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
    17:29:40.768 INFO VariantRecalibrator - Deflater: IntelDeflater
    17:29:40.768 INFO VariantRecalibrator - Inflater: IntelInflater
    17:29:40.768 INFO VariantRecalibrator - GCS max retries/reopens: 20
    17:29:40.768 INFO VariantRecalibrator - Requester pays: disabled
    17:29:40.769 INFO VariantRecalibrator - Initializing engine
    17:29:40.996 INFO FeatureManager - Using codec VCFCodec to read file file:///ibex/scratch/projects/c2014/genomics/reference/hg37/hapmap_3.3.b37.vcf
    17:29:41.053 INFO FeatureManager - Using codec VCFCodec to read file file:///ibex/scratch/projects/c2014/genomics/reference/hg37/1000G_omni2.5.b37.vcf
    17:29:41.074 INFO FeatureManager - Using codec VCFCodec to read file file:///ibex/scratch/projects/c2014/genomics/reference/hg37/1000G_phase1.snps.high_confidence.b37.vcf
    17:29:41.138 INFO FeatureManager - Using codec VCFCodec to read file file:///ibex/scratch/projects/c2014/genomics/reference/hg37/dbsnp_138.b37.vcf
    17:29:41.206 INFO FeatureManager - Using codec VCFCodec to read file file:///ibex/scratch/projects/c2029/PUBLIC_NIST/VCF/RMNISTHS_30xdownsample_SNP_INDLS_only_VCF_T.vcf
    17:29:45.164 WARN IndexUtils - Feature file "/ibex/scratch/projects/c2014/genomics/reference/hg37/1000G_phase1.snps.high_confidence.b37.vcf" appears to contain no sequence dictionary. Attempting to retrieve a sequence dictionary from the associated index file
    17:29:45.212 INFO VariantRecalibrator - Done initializing engine
    17:29:45.216 INFO TrainingSet - Found hapmap track: Known = false Training = true Truth = true Prior = Q15.0
    17:29:45.216 INFO TrainingSet - Found omni track: Known = false Training = true Truth = true Prior = Q12.0
    17:29:45.216 INFO TrainingSet - Found 1000G track: Known = false Training = true Truth = false Prior = Q10.0
    17:29:45.216 INFO TrainingSet - Found dbsnp track: Known = true Training = false Truth = false Prior = Q2.0
    17:29:45.225 WARN GATKVariantContextUtils - Can't determine output variant file format from output file extension "recal". Defaulting to VCF."



    And this is the error:

    "A USER ERROR has occurred: Bad input: Values for QD annotation not detected for ANY training variant in the input callset. VariantAnnotator may be used to add these annotations"


    May I know where is the issue? and how I fix it ?
  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @Sakhaa

    1) VariantRecalibrator requires at least 30 exome samples or 1WGS to create the training model. How many samples are you providing it? Take a look at this doc: https://software.broadinstitute.org/gatk/documentation/article?id=11097
    2) If you have enough samples, then the other possible cause could be that the program is not finding enough sites that overlap between the resource and your callset can you check what number of sites they have in common?

  • SakhaaSakhaa Member
    edited August 30
    Hi @bhanuGandham,

    1) I am working on 1 sample of WGS.
    2) How can I check that?
    I am using the following resource:
    --resource:hapmap,known=false,training=true,truth=true,prior=15.0 ${hapmap}
    --resource:omni,known=false,training=true,truth=true,prior=12.0 ${Omi2}
    --resource:1000G,known=false,training=true,truth=false,prior=10.0 ${G1000}
    --resource:dbsnp,known=true,training=false,truth=false,prior=2.0 ${dbSNP}

    but how can I know the overlap between the resource and your callset?

    Do you mean from BaseRecalibrator step?

    I have one common resource:
    dbSNP=/genomics/reference/hg37/dbsnp_138.b37.vcf
  • SakhaaSakhaa Member
    This is the command in BR step
    BaseRecalibrator -I sorted.bam -R $REF --known-sites $dbSNP --known-sites $Gold --known-sites $G1000 -O $RECAL/$PREFIX.recal_data.table
  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @Sakhaa

    I mean how many variants are common between your input vcf file and the resource files provided by you here:
    --resource:hapmap,known=false,training=true,truth=true,prior=15.0 ${hapmap}
    --resource:omni,known=false,training=true,truth=true,prior=12.0 ${Omi2}
    --resource:1000G,known=false,training=true,truth=false,prior=10.0 ${G1000}
    --resource:dbsnp,known=true,training=false,truth=false,prior=2.0 ${dbSNP}

    There are various bioinformatics tools out there to find overlaps between vcf files, one such tool is bedtools intersect

  • SakhaaSakhaa Member

    Hi @bhanuGandham

    yes there are overlaps between 3 of them and call set, and May I ask how overlaps effect on results and dose the annotation effect also, is haplotype caller step I did not use any annotation I used the following command:
    gatk HaplotypeCaller -R $REF -I $BAM/$PREFIX.recal_data.bam --genotyping-mode DISCOVERY -O $VCF/$PREFIX.snps.indels.vcf

  • SakhaaSakhaa Member
    edited September 10

    Hi @bhanuGandham

    yes there are overlaps between 3 of them and call set, and May I ask how overlaps effect on results and dose the annotation effect also, is haplotype caller step I did not use any annotation I used the following command:
    gatk HaplotypeCaller -R $REF -I $BAM/$PREFIX.recal_data.bam --genotyping-mode DISCOVERY -O $VCF/$PREFIX.snps.indels.vcf

    Then,
    gatk GenotypeGVCFs -R $REF -V $VCF/${PREFIX}_snps.indels.g.vcf -O $VCF/${PREFIX}_SNP_INDLS.vcf -D ${dbsnp} -G StandardAnnotation --use-new-qual-calculator

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin
    edited September 12

    HI @Sakhaa

    The way that VariantRecalibrator work is that it creates a machine learning model based on various variant annotations to differentiate between good and bad variants. It does that by focusing on a subset of validated variant sites from the resource files(such as omni, 1000 Genomes, hapmap etc.). VQSR needs sufficient overlapping variants between resources and input vcf to create the ML model. This is why we require users use at least 30 exome or 1WGS. But sometimes even 1WGS is not enough if the overlap is small. Take a look at this doc for more details: https://software.broadinstitute.org/gatk/documentation/article?id=11084
    Looking at your error it seems like there isn't enough overlap between the resource files and the input vcf to create the model.
    Take a look at our single sample variant filtering tool: CNNScoreVariants.
    Here is a blog about this tool: https://software.broadinstitute.org/gatk/blog?id=23457
    This is a better approach in your case.

  • SakhaaSakhaa Member

    Thank you very Much @bhanuGandham ,

    I have used the CNN tool and also I got an error, so I will ask the team in a new tag to be more organized :smile:

Sign In or Register to comment.