The current GATK version is 3.2-2

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Bug Bulletin: The GenomeLocPArser error in SplitNCigarReads has been fixed; if you encounter it, use the latest nightly build.

# VariantRecalibrator - no data found

Posts: 4Member

I just updated to the latest nightly and got the same error:

INFO 12:03:16,652 VariantRecalibratorEngine - Finished iteration 45. Current change in mixture coefficients = 0.00258 INFO 12:03:23,474 ProgressMeter - GL000202.1:10465 5.68e+07 32.4 m 34.0 s 98.7% 32.9 m 25.0 s INFO 12:03:32,263 VariantRecalibratorEngine - Convergence after 46 iterations! INFO 12:03:41,008 VariantRecalibratorEngine - Evaluating full set of 4944219 variants... INFO 12:03:41,100 VariantDataManager - Training with worst 0 scoring variants --> variants with LOD <= -5.0000.

java.lang.IllegalArgumentException: No data found. at org.broadinstitute.sting.gatk.walkers.variantrecalibration.VariantRecalibratorEngine.generateModel(VariantRecalibratorEngine.java:83) at org.broadinstitute.sting.gatk.walkers.variantrecalibration.VariantRecalibrator.onTraversalDone(VariantRecalibrator.java:392) at org.broadinstitute.sting.gatk.walkers.variantrecalibration.VariantRecalibrator.onTraversalDone(VariantRecalibrator.java:138) at org.broadinstitute.sting.gatk.executive.Accumulator$StandardAccumulator.finishTraversal(Accumulator.java:129) at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:116) at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:313) at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:121) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:248) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:155) at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:107) ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR A GATK RUNTIME ERROR has occurred (version nightly-2014-03-20-g65934ae): ##### ERROR ##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem. ##### ERROR If not, please post the error message, with stack trace, to the GATK forum. ##### ERROR Visit our website and forum for extensive documentation and answers to ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk ##### ERROR ##### ERROR MESSAGE: No data found. ##### ERROR ------------------------------------------------------------------------------------------ Tagged: ## Answers • Posts: 6,213Administrator, GATK Developer admin What's your command line? Can you post your full output log? Geraldine Van der Auwera, PhD • Posts: 4Member Attached. It contains the command line as well. • Posts: 6,213Administrator, GATK Developer admin Thanks. I see you're running with --mode BOTH, which is unsupported and goes against our recommendations. This may not be the cause of the issue you encountered, but you'll need to try again in SNP or INDEL mode before I can help you. Geraldine Van der Auwera, PhD • Baltimore, MDPosts: 14Member I am getting an identical error message with almost identical command line usage and using --mode SNP. Is there any way I can get some debugging help from you? I am running gatk version 3.1-1-g07a4bf8. • Posts: 6,213Administrator, GATK Developer admin @noushin6, can you please post your command lines? Geraldine Van der Auwera, PhD • Baltimore, MDPosts: 14Member edited April 1 Sure. Here is my commandline: java -Xmx${heap}m -jar ${gatk}\ -T VariantRecalibrator\ -R${refSequence}\
-input ${SCRATCH}/${sample}.raw_variants.vcf\
-resource:hapmap,known=false,training=true,truth=true,prior=15.0 ${trHAPMAP}\ -resource:omni,known=false,training=true,truth=true,prior=12.0${trOMNI}\
-resource:1000G,known=false,training=true,truth=false,prior=10.0 ${tr1KG}\ -resource:dbsnp,known=true,training=false,truth=false,prior=2.0${trDBSNP}\
-an DP\
-an QD\
-an FS\
-an MQRankSum\
-mode SNP\
-tranche 100 -tranche 99.9 -tranche 99.0 -tranche 90.0 \
-recalFile ${SCRATCH}/${sample}.recalibrate_SNP.recal\
-tranchesFile ${SCRATCH}/${sample}.recalibrate_SNP.tranches\
-rscriptFile ${SCRATCH}/${sample}.recalibrate_SNP_plots.R


The variables point to corresponding paths, as the line above is a segment from a makefile.

Thank you!

Post edited by noushin6 on

And what is your ApplyRecalibration command line?

Geraldine Van der Auwera, PhD

• Baltimore, MDPosts: 14Member
edited April 1

Here it my ApplyRecalibration command line:

java -Xmx${heap}m -Djava.io.tmpdir=${temp_folder}_snp_recal\
-jar ${gatk}\ -T ApplyRecalibration\ -R${refSequence}\
-input ${SCRATCH}/${sample}.raw_variants.vcf\
-mode SNP\
--ts_filter_level 99.0\
-recalFile ${SCRATCH}/${sample}.recalibrate_SNP.recal\
-tranchesFile ${SCRATCH}/${sample}.recalibrate_SNP.tranches\
-o ${SCRATCH}/${sample}.recalibrate_snps_raw_indels.vcf


This is my next step after VariantRecalibrator call above that fails. I am trying to follow the steps in http://www.broadinstitute.org/gatk/guide/topic?name=tutorials.

Thanks!

Post edited by noushin6 on

Mmkay, not seeing much -- can you post the log file for one run? Are you running recalibration per sample? This is not our recommended workflow...

Geraldine Van der Auwera, PhD

• Baltimore, MDPosts: 14Member

Do you mean the log file for one run of VariantRecalibrator?

I am possibly confused about the recommended workflow at this stage. Can you please point me to the proper section of documentation?

I am planning to run HaplotypeCaller on my individual samples to generate the initial set of variant calls. Should I do a bam file merge from multiple samples before calling HaplotypeCaller? The experiment I am looking at has very few normal tissue bam files.

I meant the log files for one run of VariantRecalibrator and the corresponding run of ApplyRecalibration. The point is to make sure that the inputs and outputs are matching up correctly.

You have two possible workflow options for your experiment. One is to call variants on your samples all together, which produces a multisample VCF that you then put through VQSR, as described in the existing Best Practices document here.

The second option is a brand new workflow which will replace the one I just described (we're still updating the docs). The idea is that instead of calling variants together on all samples, you do it per-sample, but in a special mode that produces GVCFs. The you run a new joint genotyping step on the GVFCs, which produces a regular multisample VCF, that you then put through VQSR. This allows you to bypass the performance issues associated with multisample calling. See here for more details.

In any case you should not be running VQSR on individual samples, because that will cause your analysis to be underpowered. But keep in mind that unless you use the new workflow (with GVCFs and the additional joint genotyping step), you also can't run VQSR together on samples that were called separately.

Let me know if you need any further clarification.

Geraldine Van der Auwera, PhD

• nycPosts: 4Member

Hi there,

I am running into a very similar error to tgenahmet. I am trying to recalibrate some variants that I produced from a .bam file produced from two trusightone pair-ended reads. Attached is my command line and the output. Any help would be greatly appreciated. Could the error be due to the fact that I am trying to call these variants from only one .bam file?

Hi,

Is this a single exome you are running on? It's not recommended to run on only one exome sample (WGS may be ok).

If you only have one exome sample, you can use data from the 1000 Genomes project to beef up your data set. http://www.1000genomes.org/data

-Sheila

• nycPosts: 4Member

I am only looking at one individuals exome so that could be the issue. My vcf file contains over 10k variants; what is typically a suitably large number of variants? The main reason why your suggestion is puzzling to me is that in the past when testing this pipeline I was getting an error prompting me to use -minNumBad to fix my error. However, now I am not getting that error and am instead just seeing: ##### ERROR MESSAGE: No data found.

Any suggestions would be great. I am retrying with -minNumBad

Thanks again!

• nycPosts: 4Member

Just ran the data with -minNumBad set to 5000 and got the same error. Attached is my command line and output. Would you suggest just merging in some more samples to increase my amount of data? Thanks!

Hi,

Yes, the best thing to do is to use data from the 1000 Genomes project. Please find the data here: http://www.1000genomes.org/data

-Sheila

• nycPosts: 4Member

Hi @Sheila,

Thanks for your help up to now. So I merged my vcf with the 1000 genomes vcf and then ran VQSR with for snp's and it was successful (FINALLY)! But when I tried to run the output through VQSR again for indels I received the same error I got before. This time around it found very few indels obviously. Is this because I chose 1000G as my additional data?

Thanks again!

Hi,

Two things:

1) You should not simply merge your vcfs with the 1000G vcfs. You should get the 1000G bams, run the calling pipeline to generate GVCFs, do joint genotyping on all gvcfs together, then you finally do VQSR. I realize this was not apparent in my original post, so I will be preparing a new article explaining this more clearly.

2) Issues with indels are frequent because they are so much less frequent than SNPs. This is not caused by choosing 1000G. I do not know how much data you used, but you might need to use more data from 1000G. Our recommendation is to use 30 or more bams.

Good luck!

-Sheila

• kamilo889Posts: 6Member

HI all I am just trying to run VQSR but appear the error no data found in variantRecalibrator, reading the comments seems to be because I'm performing an exome analysis from an individual sample ... and is not enough data to run the program right?... but isn't clear to me the the thing of "merge the file with the 1000G" can you help me a little bit more please

Regards Camilo

Hi Camilo,