Bug Bulletin: The recent 3.2 release fixes many issues. If you run into a problem, please try the latest version before posting a bug report, as your problem may already have been solved.

VariantRecalibrator - no data found

tgenahmettgenahmet Posts: 4Member

I just updated to the latest nightly and got the same error:

INFO 12:03:16,652 VariantRecalibratorEngine - Finished iteration 45. Current change in mixture coefficients = 0.00258 INFO 12:03:23,474 ProgressMeter - GL000202.1:10465 5.68e+07 32.4 m 34.0 s 98.7% 32.9 m 25.0 s INFO 12:03:32,263 VariantRecalibratorEngine - Convergence after 46 iterations! INFO 12:03:41,008 VariantRecalibratorEngine - Evaluating full set of 4944219 variants... INFO 12:03:41,100 VariantDataManager - Training with worst 0 scoring variants --> variants with LOD <= -5.0000.

ERROR ------------------------------------------------------------------------------------------
ERROR stack trace

java.lang.IllegalArgumentException: No data found. at org.broadinstitute.sting.gatk.walkers.variantrecalibration.VariantRecalibratorEngine.generateModel(VariantRecalibratorEngine.java:83) at org.broadinstitute.sting.gatk.walkers.variantrecalibration.VariantRecalibrator.onTraversalDone(VariantRecalibrator.java:392) at org.broadinstitute.sting.gatk.walkers.variantrecalibration.VariantRecalibrator.onTraversalDone(VariantRecalibrator.java:138) at org.broadinstitute.sting.gatk.executive.Accumulator$StandardAccumulator.finishTraversal(Accumulator.java:129) at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:116) at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:313) at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:121) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:248) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:155) at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:107)

ERROR ------------------------------------------------------------------------------------------
ERROR A GATK RUNTIME ERROR has occurred (version nightly-2014-03-20-g65934ae):
ERROR
ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
ERROR If not, please post the error message, with stack trace, to the GATK forum.
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR MESSAGE: No data found.
ERROR ------------------------------------------------------------------------------------------

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,973Administrator, GATK Developer admin

    What's your command line? Can you post your full output log?

    Geraldine Van der Auwera, PhD

  • tgenahmettgenahmet Posts: 4Member

    Attached. It contains the command line as well.

    txt
    txt
    GhanaTNBC_0010_1_SA_Whole_C1_KAWGL_J00022-GhanaTNBC_0010_1_BE_Whole_T2_KAWGL_J00023.UG.vcf.varRecalibrator.txt
    18K
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,973Administrator, GATK Developer admin

    Thanks. I see you're running with --mode BOTH, which is unsupported and goes against our recommendations. This may not be the cause of the issue you encountered, but you'll need to try again in SNP or INDEL mode before I can help you.

    Geraldine Van der Auwera, PhD

  • noushin6noushin6 Baltimore, MDPosts: 14Member

    I am getting an identical error message with almost identical command line usage and using --mode SNP. Is there any way I can get some debugging help from you? I am running gatk version 3.1-1-g07a4bf8.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,973Administrator, GATK Developer admin

    @noushin6, can you please post your command lines?

    Geraldine Van der Auwera, PhD

  • noushin6noushin6 Baltimore, MDPosts: 14Member
    edited April 1

    Sure. Here is my commandline:

    java -Xmx${heap}m -jar ${gatk}\
     -T VariantRecalibrator\
     -R ${refSequence}\
     -input ${SCRATCH}/${sample}.raw_variants.vcf\
     -resource:hapmap,known=false,training=true,truth=true,prior=15.0 ${trHAPMAP}\
     -resource:omni,known=false,training=true,truth=true,prior=12.0 ${trOMNI}\
     -resource:1000G,known=false,training=true,truth=false,prior=10.0 ${tr1KG}\
     -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 ${trDBSNP}\
     -an DP\
     -an QD\
     -an FS\
     -an MQRankSum\
     -an ReadPosRankSum\
     -mode SNP\
     -tranche 100 -tranche 99.9 -tranche 99.0 -tranche 90.0 \
     -recalFile ${SCRATCH}/${sample}.recalibrate_SNP.recal\
     -tranchesFile ${SCRATCH}/${sample}.recalibrate_SNP.tranches\
         -rscriptFile ${SCRATCH}/${sample}.recalibrate_SNP_plots.R
    

    The variables point to corresponding paths, as the line above is a segment from a makefile.

    Thank you!

    Post edited by noushin6 on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,973Administrator, GATK Developer admin

    And what is your ApplyRecalibration command line?

    Geraldine Van der Auwera, PhD

  • noushin6noushin6 Baltimore, MDPosts: 14Member
    edited April 1

    Here it my ApplyRecalibration command line:

    java -Xmx${heap}m -Djava.io.tmpdir=${temp_folder}_snp_recal\                                                                                    
        -jar ${gatk}\                                                                                                                                  
        -T ApplyRecalibration\                                                                                                                         
        -R ${refSequence}\                                                                                                                             
        -input ${SCRATCH}/${sample}.raw_variants.vcf\                                                                                                  
        -mode SNP\                                                                                                                                     
        --ts_filter_level 99.0\                                                                                                                        
        -recalFile ${SCRATCH}/${sample}.recalibrate_SNP.recal\                                                                                         
        -tranchesFile ${SCRATCH}/${sample}.recalibrate_SNP.tranches\                                                                                   
        -o ${SCRATCH}/${sample}.recalibrate_snps_raw_indels.vcf
    

    This is my next step after VariantRecalibrator call above that fails. I am trying to follow the steps in http://www.broadinstitute.org/gatk/guide/topic?name=tutorials.

    Thanks!

    Post edited by noushin6 on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,973Administrator, GATK Developer admin

    Mmkay, not seeing much -- can you post the log file for one run? Are you running recalibration per sample? This is not our recommended workflow...

    Geraldine Van der Auwera, PhD

  • noushin6noushin6 Baltimore, MDPosts: 14Member

    Do you mean the log file for one run of VariantRecalibrator?

    I am possibly confused about the recommended workflow at this stage. Can you please point me to the proper section of documentation?

    I am planning to run HaplotypeCaller on my individual samples to generate the initial set of variant calls. Should I do a bam file merge from multiple samples before calling HaplotypeCaller? The experiment I am looking at has very few normal tissue bam files.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,973Administrator, GATK Developer admin

    I meant the log files for one run of VariantRecalibrator and the corresponding run of ApplyRecalibration. The point is to make sure that the inputs and outputs are matching up correctly.

    You have two possible workflow options for your experiment. One is to call variants on your samples all together, which produces a multisample VCF that you then put through VQSR, as described in the existing Best Practices document here.

    The second option is a brand new workflow which will replace the one I just described (we're still updating the docs). The idea is that instead of calling variants together on all samples, you do it per-sample, but in a special mode that produces GVCFs. The you run a new joint genotyping step on the GVFCs, which produces a regular multisample VCF, that you then put through VQSR. This allows you to bypass the performance issues associated with multisample calling. See here for more details.

    In any case you should not be running VQSR on individual samples, because that will cause your analysis to be underpowered. But keep in mind that unless you use the new workflow (with GVCFs and the additional joint genotyping step), you also can't run VQSR together on samples that were called separately.

    Let me know if you need any further clarification.

    Geraldine Van der Auwera, PhD

  • michael_recombinemichael_recombine nycPosts: 4Member

    Hi there,

    I am running into a very similar error to tgenahmet. I am trying to recalibrate some variants that I produced from a .bam file produced from two trusightone pair-ended reads. Attached is my command line and the output. Any help would be greatly appreciated. Could the error be due to the fact that I am trying to call these variants from only one .bam file?

    Thank you in advance!

    txt
    txt
    Variant_recalibrator_error.txt
    6K
  • SheilaSheila Broad InstitutePosts: 354Member, GATK Developer, Broadie, Moderator admin

    @michael_recombine

    Hi,

    Is this a single exome you are running on? It's not recommended to run on only one exome sample (WGS may be ok).

    If you only have one exome sample, you can use data from the 1000 Genomes project to beef up your data set. http://www.1000genomes.org/data

    Or, you can try to force the number of bad variants using minNumBadVariants (http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_variantrecalibration_VariantRecalibrator.html#--minNumBadVariants). But, adding more data is better.

    -Sheila

  • michael_recombinemichael_recombine nycPosts: 4Member

    I am only looking at one individuals exome so that could be the issue. My vcf file contains over 10k variants; what is typically a suitably large number of variants? The main reason why your suggestion is puzzling to me is that in the past when testing this pipeline I was getting an error prompting me to use -minNumBad to fix my error. However, now I am not getting that error and am instead just seeing: ##### ERROR MESSAGE: No data found.

    Any suggestions would be great. I am retrying with -minNumBad

    Thanks again!

  • michael_recombinemichael_recombine nycPosts: 4Member

    Just ran the data with -minNumBad set to 5000 and got the same error. Attached is my command line and output. Would you suggest just merging in some more samples to increase my amount of data? Thanks!

    txt
    txt
    VQSR_error_2.txt
    8K
  • SheilaSheila Broad InstitutePosts: 354Member, GATK Developer, Broadie, Moderator admin

    @michael_recombine

    Hi,

    Yes, the best thing to do is to use data from the 1000 Genomes project. Please find the data here: http://www.1000genomes.org/data

    -Sheila

  • michael_recombinemichael_recombine nycPosts: 4Member

    Hi @Sheila,

    Thanks for your help up to now. So I merged my vcf with the 1000 genomes vcf and then ran VQSR with for snp's and it was successful (FINALLY)! But when I tried to run the output through VQSR again for indels I received the same error I got before. This time around it found very few indels obviously. Is this because I chose 1000G as my additional data?

    Thanks again!

    txt
    txt
    Indel_ERROR.txt
    5K
  • SheilaSheila Broad InstitutePosts: 354Member, GATK Developer, Broadie, Moderator admin

    @michael_recombine

    Hi,

    Two things:

    1) You should not simply merge your vcfs with the 1000G vcfs. You should get the 1000G bams, run the calling pipeline to generate GVCFs, do joint genotyping on all gvcfs together, then you finally do VQSR. I realize this was not apparent in my original post, so I will be preparing a new article explaining this more clearly.

    2) Issues with indels are frequent because they are so much less frequent than SNPs. This is not caused by choosing 1000G. I do not know how much data you used, but you might need to use more data from 1000G. Our recommendation is to use 30 or more bams.

    Good luck!

    -Sheila

Sign In or Register to comment.