Apparent bug in Variante Recalibrator

jllavinjllavin Member
edited September 2014 in Ask the GATK team

As adviced by the own GATK program I post this potential bug in the forum:

This is the command I used to call the program from my Perl pipeline:

java -Xmx4g -Djava.io.tmpdir=/tmp -jar /opt/gatk/gatk3.2-3/GenomeAnalysisTK.jar 
-T VariantRecalibrator 
-R /storage/Genomes/GATK/hg19.fasta -input "$path"/"$name".recalibrated_snps_raw_indels.vcf -resource:mills,known=true,training=true,truth=true,prior=12.0 
/storage/Genomes/GATK/Mills_and_1000G_gold_standard.indels.hg19.vcf 
-an DP -an FS -an MQRankSum -an ReadPosRankSum -mode INDEL 
-tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 --maxGaussians 4 
-recalFile "$path"/"$name".recalibrate_INDEL.recal 
-tranchesFile "$path"/"$name".recalibrate_INDEL.tranches 
-rscriptFile "$path"/"$name".recalibrate_INDEL_plots.R 

"INFO 13:04:24,773 HelpFormatter - --------------------------------------------------------------------------------
INFO 13:04:24,776 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.2-2-gec30cee, Compiled 2014/07/17 15:22:03
INFO 13:04:24,776 HelpFormatter - Copyright (c) 2010 The Broad Institute
INFO 13:04:24,777 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk
INFO 13:04:24,783 HelpFormatter - Program Args: -T VariantRecalibrator -R /storage/Genomes/GATK/hg19.fasta -input /storage/Runs/CIC/ANALYSIS/IVI_01_ExomeReseq_Sept2014/FASTQ/trio/LR1_8233_PE.recalibrated_snps_raw_indels.vcf -resource:mills,known=true,training=true,truth=true,prior=12.0 /storage/Genomes/GATK/Mills_and_1000G_gold_standard.indels.hg19.vcf -an DP -an FS -an MQRankSum -an ReadPosRankSum -mode INDEL -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 --maxGaussians 4 -recalFile /storage/Runs/CIC/ANALYSIS/IVI_01_ExomeReseq_Sept2014/FASTQ/trio/LR1_8233_PE.recalibrate_INDEL.recal -tranchesFile /storage/Runs/CIC/ANALYSIS/IVI_01_ExomeReseq_Sept2014/FASTQ/trio/LR1_8233_PE.recalibrate_INDEL.tranches -rscriptFile /storage/Runs/CIC/ANALYSIS/IVI_01_ExomeReseq_Sept2014/FASTQ/trio/LR1_8233_PE.recalibrate_INDEL_plots.R
INFO 13:04:24,786 HelpFormatter - Executing as [email protected] on Linux 2.6.32-431.29.2.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_21-b11.
INFO 13:04:24,787 HelpFormatter - Date/Time: 2014/09/19 13:04:24
INFO 13:04:24,787 HelpFormatter - --------------------------------------------------------------------------------
INFO 13:04:24,787 HelpFormatter - --------------------------------------------------------------------------------
INFO 13:04:29,468 GenomeAnalysisEngine - Strictness is SILENT
INFO 13:04:30,018 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000
INFO 13:04:30,244 GenomeAnalysisEngine - Preparing for traversal
INFO 13:04:30,271 GenomeAnalysisEngine - Done preparing for traversal
INFO 13:04:30,271 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
INFO 13:04:30,272 ProgressMeter - | processed | time | per 1M | | total | remaining
INFO 13:04:30,272 ProgressMeter - Location | sites | elapsed | sites | completed | runtime | runtime
INFO 13:04:30,279 TrainingSet - Found mills track: Known = true Training = true Truth = true Prior = Q12.0
INFO 13:05:00,823 ProgressMeter - chr5:94965875 1168956.0 30.0 s 26.0 s 31.1% 96.0 s 66.0 s
INFO 13:05:24,266 VariantDataManager - DP: mean = 16.62 standard deviation = 10.59
INFO 13:05:24,268 VariantDataManager - FS: mean = 1.14 standard deviation = 2.63
INFO 13:05:24,270 VariantDataManager - MQRankSum: mean = -0.16 standard deviation = 1.11
INFO 13:05:24,273 VariantDataManager - ReadPosRankSum: mean = -0.04 standard deviation = 0.99
INFO 13:05:24,300 VariantDataManager - Annotations are now ordered by their information content: [DP, FS, MQRankSum, ReadPosRankSum]
INFO 13:05:24,301 VariantDataManager - Training with 1829 variants after standard deviation thresholding.
INFO 13:05:24,305 GaussianMixtureModel - Initializing model with 100 k-means iterations...
INFO 13:05:24,475 VariantRecalibratorEngine - Finished iteration 0.
INFO 13:05:24,535 VariantRecalibratorEngine - Finished iteration 5. Current change in mixture coefficients = 0.20003
INFO 13:05:24,564 VariantRecalibratorEngine - Finished iteration 10. Current change in mixture coefficients = 4.20358
INFO 13:05:24,582 VariantRecalibratorEngine - Finished iteration 15. Current change in mixture coefficients = 0.00341
INFO 13:05:24,593 VariantRecalibratorEngine - Convergence after 18 iterations!
INFO 13:05:24,611 VariantRecalibratorEngine - Evaluating full set of 3202 variants...
INFO 13:05:24,612 VariantDataManager - Training with worst 0 scoring variants --> variants with LOD <= -5.0000.
INFO 13:05:26,140 GATKRunReport - Uploaded run statistics report to AWS S3

ERROR ------------------------------------------------------------------------------------------
ERROR stack trace

java.lang.IllegalArgumentException: No data found.
at org.broadinstitute.gatk.tools.walkers.variantrecalibration.VariantRecalibratorEngine.generateModel(VariantRecalibratorEngine.java:83)
at org.broadinstitute.gatk.tools.walkers.variantrecalibration.VariantRecalibrator.onTraversalDone(VariantRecalibrator.java:392)
at org.broadinstitute.gatk.tools.walkers.variantrecalibration.VariantRecalibrator.onTraversalDone(VariantRecalibrator.java:138)
at org.broadinstitute.gatk.engine.executive.Accumulator$StandardAccumulator.finishTraversal(Accumulator.java:129)
at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:116)
at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:314)
at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121)
at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248)
at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155)
at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:107)

ERROR ------------------------------------------------------------------------------------------
ERROR A GATK RUNTIME ERROR has occurred (version 3.2-2-gec30cee):
ERROR
ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
ERROR If not, please post the error message, with stack trace, to the GATK forum.
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR MESSAGE: No data found.
ERROR ------------------------------------------------------------------------------------------

"

I hope it can be easily fixed since I desperately need my pipeline working for my current analysis
;)

Comments

  • tommycarstensentommycarstensen United KingdomMember ✭✭✭

    @jllavin Have you checked that your input files and output directories exist?

  • jllavinjllavin Member

    @tommycarstensen, everything seemed to be allright, but I will re-check those files to make sure everything is correct and I will report my findings here ;)
    Maybe I posted to swiftly due to the error message asking me to do so...anyway I'll come back with an answer to the point you highlighted here:
    Thank you very much for your suggestion.

  • pdexheimerpdexheimer Member ✭✭✭✭

    This error is because VariantRecalibrator didn't have any negative training variants - the relevant lines are

    INFO 13:05:24,612 VariantDataManager - Training with worst 0 scoring variants --> variants with LOD <= -5.0000.
    

    and

    java.lang.IllegalArgumentException: No data found. at 
    

    I think, but am not certain, that this error is caused by having too few variants in the input. It might also be exacerbated by giving the only set of "truth" variants a pretty low confidence value, but I'm not sure

  • jllavinjllavin Member

    @pdexheimer‌, it seems to be the case you describe...

    In such a case, what can I do to fix this problem, I mean, the data sets I'm analyzing fail in the Indels Recalibration step.
    Is there any parameter I can modify to obtain the variants present in my input, or shall I try to find an alternative to GATK in order to fulfill this particular task? If so, any tool suggestions?
    Thanks in advance.

  • pdexheimerpdexheimer Member ✭✭✭✭

    @jllavin‌ - I think you're confused about the purpose of recalibration. You've already called indels, you're now trying to filter out the false positive calls. If you haven't already, it would be well worth your time to read through the Best Practices, particularly all of the linked articles about VQSR (under Variant Discovery/Variant Filtering in the BP).

    To answer your question more directly, the only solution in cases where you can't run VQSR is to do a hard filtering step

  • jllavinjllavin Member

    @pdexheimer‌,
    Thank you very much, I think I have solved the problem following your advice ;)

    Best wishes

    JL

Sign In or Register to comment.