# VariantRecalibrator - no data found

Member Posts: 4

I just updated to the latest nightly and got the same error:

INFO 12:03:16,652 VariantRecalibratorEngine - Finished iteration 45. Current change in mixture coefficients = 0.00258
INFO 12:03:23,474 ProgressMeter - GL000202.1:10465 5.68e+07 32.4 m 34.0 s 98.7% 32.9 m 25.0 s
INFO 12:03:32,263 VariantRecalibratorEngine - Convergence after 46 iterations!
INFO 12:03:41,008 VariantRecalibratorEngine - Evaluating full set of 4944219 variants...
INFO 12:03:41,100 VariantDataManager - Training with worst 0 scoring variants --> variants with LOD <= -5.0000.

##### ERROR stack trace

java.lang.IllegalArgumentException: No data found.
at org.broadinstitute.sting.gatk.executive.Accumulator$StandardAccumulator.finishTraversal(Accumulator.java:129) at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:116) at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:313) at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:121) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:248) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:155) at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:107) ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR A GATK RUNTIME ERROR has occurred (version nightly-2014-03-20-g65934ae): ##### ERROR ##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem. ##### ERROR If not, please post the error message, with stack trace, to the GATK forum. ##### ERROR Visit our website and forum for extensive documentation and answers to ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk ##### ERROR ##### ERROR MESSAGE: No data found. ##### ERROR ------------------------------------------------------------------------------------------ Tagged: ## Answers • Administrator, Dev Posts: 11,118 admin What's your command line? Can you post your full output log? Geraldine Van der Auwera, PhD • Member Posts: 4 Attached. It contains the command line as well. • Administrator, Dev Posts: 11,118 admin Thanks. I see you're running with --mode BOTH, which is unsupported and goes against our recommendations. This may not be the cause of the issue you encountered, but you'll need to try again in SNP or INDEL mode before I can help you. Geraldine Van der Auwera, PhD • Baltimore, MDMember Posts: 17 I am getting an identical error message with almost identical command line usage and using --mode SNP. Is there any way I can get some debugging help from you? I am running gatk version 3.1-1-g07a4bf8. • Administrator, Dev Posts: 11,118 admin @noushin6, can you please post your command lines? Geraldine Van der Auwera, PhD • Baltimore, MDMember Posts: 17 edited April 2014 Sure. Here is my commandline: java -Xmx${heap}m -jar ${gatk}\ -T VariantRecalibrator\ -R${refSequence}\
-input ${SCRATCH}/${sample}.raw_variants.vcf\
-resource:hapmap,known=false,training=true,truth=true,prior=15.0 ${trHAPMAP}\ -resource:omni,known=false,training=true,truth=true,prior=12.0${trOMNI}\
-resource:1000G,known=false,training=true,truth=false,prior=10.0 ${tr1KG}\ -resource:dbsnp,known=true,training=false,truth=false,prior=2.0${trDBSNP}\
-an DP\
-an QD\
-an FS\
-an MQRankSum\
-mode SNP\
-tranche 100 -tranche 99.9 -tranche 99.0 -tranche 90.0 \
-recalFile ${SCRATCH}/${sample}.recalibrate_SNP.recal\
-tranchesFile ${SCRATCH}/${sample}.recalibrate_SNP.tranches\
-rscriptFile ${SCRATCH}/${sample}.recalibrate_SNP_plots.R


The variables point to corresponding paths, as the line above is a segment from a makefile.

And what is your ApplyRecalibration command line?

• Baltimore, MDMember Posts: 17
edited April 2014

Here it my ApplyRecalibration command line:

java -Xmx${heap}m -Djava.io.tmpdir=${temp_folder}_snp_recal\
-jar ${gatk}\ -T ApplyRecalibration\ -R${refSequence}\
-input ${SCRATCH}/${sample}.raw_variants.vcf\
-mode SNP\
--ts_filter_level 99.0\
-recalFile ${SCRATCH}/${sample}.recalibrate_SNP.recal\
-tranchesFile ${SCRATCH}/${sample}.recalibrate_SNP.tranches\
-o ${SCRATCH}/${sample}.recalibrate_snps_raw_indels.vcf


This is my next step after VariantRecalibrator call above that fails. I am trying to follow the steps in http://www.broadinstitute.org/gatk/guide/topic?name=tutorials.

Mmkay, not seeing much -- can you post the log file for one run? Are you running recalibration per sample? This is not our recommended workflow...

• Baltimore, MDMember Posts: 17

Do you mean the log file for one run of VariantRecalibrator?

I am possibly confused about the recommended workflow at this stage. Can you please point me to the proper section of documentation?

I am planning to run HaplotypeCaller on my individual samples to generate the initial set of variant calls. Should I do a bam file merge from multiple samples before calling HaplotypeCaller? The experiment I am looking at has very few normal tissue bam files.

I meant the log files for one run of VariantRecalibrator and the corresponding run of ApplyRecalibration. The point is to make sure that the inputs and outputs are matching up correctly.

You have two possible workflow options for your experiment. One is to call variants on your samples all together, which produces a multisample VCF that you then put through VQSR, as described in the existing Best Practices document here.

The second option is a brand new workflow which will replace the one I just described (we're still updating the docs). The idea is that instead of calling variants together on all samples, you do it per-sample, but in a special mode that produces GVCFs. The you run a new joint genotyping step on the GVFCs, which produces a regular multisample VCF, that you then put through VQSR. This allows you to bypass the performance issues associated with multisample calling. See here for more details.

In any case you should not be running VQSR on individual samples, because that will cause your analysis to be underpowered. But keep in mind that unless you use the new workflow (with GVCFs and the additional joint genotyping step), you also can't run VQSR together on samples that were called separately.

Let me know if you need any further clarification.

• nycMember Posts: 4

Hi there,

I am running into a very similar error to tgenahmet. I am trying to recalibrate some variants that I produced from a .bam file produced from two trusightone pair-ended reads. Attached is my command line and the output. Any help would be greatly appreciated. Could the error be due to the fact that I am trying to call these variants from only one .bam file?

Thank you in advance!

@michael_recombine‌

Hi,

Is this a single exome you are running on? It's not recommended to run on only one exome sample (WGS may be ok).

If you only have one exome sample, you can use data from the 1000 Genomes project to beef up your data set. http://www.1000genomes.org/data

Or, you can try to force the number of bad variants using minNumBadVariants (http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_variantrecalibration_VariantRecalibrator.html#--minNumBadVariants). But, adding more data is better.

• nycMember Posts: 4

I am only looking at one individuals exome so that could be the issue. My vcf file contains over 10k variants; what is typically a suitably large number of variants? The main reason why your suggestion is puzzling to me is that in the past when testing this pipeline I was getting an error prompting me to use -minNumBad to fix my error. However, now I am not getting that error and am instead just seeing: ##### ERROR MESSAGE: No data found.

Any suggestions would be great. I am retrying with -minNumBad

Thanks again!

• nycMember Posts: 4

Just ran the data with -minNumBad set to 5000 and got the same error. Attached is my command line and output. Would you suggest just merging in some more samples to increase my amount of data? Thanks!

@michael_recombine‌

Hi,

Yes, the best thing to do is to use data from the 1000 Genomes project. Please find the data here: http://www.1000genomes.org/data

• nycMember Posts: 4

Hi @Sheila,

Thanks for your help up to now. So I merged my vcf with the 1000 genomes vcf and then ran VQSR with for snp's and it was successful (FINALLY)! But when I tried to run the output through VQSR again for indels I received the same error I got before. This time around it found very few indels obviously. Is this because I chose 1000G as my additional data?

Thanks again!

@michael_recombine‌

Hi,

Two things:

1) You should not simply merge your vcfs with the 1000G vcfs. You should get the 1000G bams, run the calling pipeline to generate GVCFs, do joint genotyping on all gvcfs together, then you finally do VQSR. I realize this was not apparent in my original post, so I will be preparing a new article explaining this more clearly.

2) Issues with indels are frequent because they are so much less frequent than SNPs. This is not caused by choosing 1000G. I do not know how much data you used, but you might need to use more data from 1000G. Our recommendation is to use 30 or more bams.

Good luck!

• kamilo889Member Posts: 6

HI all I am just trying to run VQSR but appear the error no data found in variantRecalibrator, reading the comments seems to be because I'm performing an exome analysis from an individual sample ... and is not enough data to run the program right?... but isn't clear to me the the thing of "merge the file with the 1000G" can you help me a little bit more please

Regards
Camilo

@kamilo889‌

Hi Camilo,

You are correct that 1 exome is not enough data for VQSR. Instead of VQSR, you can try hard flitering. Please read about hard filtering here: http://gatkforums.broadinstitute.org/discussion/2806/howto-apply-hard-filters-to-a-call-set

If you want to use VQSR, you will first need to get more data from the 1000Genomes data webpage here: http://www.1000genomes.org/data

You should get at least 30 or more bam files from samples that are genetically similar to your sample exome and run all of the steps involved in doing a joint analysis. Please read about how to do a joint analysis here: http://www.broadinstitute.org/gatk/guide/article?id=3893

I hope this helps.

• Yale UniversityMember Posts: 4

Hi, I'm actually running in the same issue with the SNP VSQR ran on a multisample vcf file produced by genotypegvcf. This is a set of 21 exomes (I know about the k~30 recommendation but 1000 Genomes do not have ethnically similar samples, and I'm mostly doing this to pilot the workflow).

Here's the code

java -jar /PATH/GenomeAnalysisTK.jar \
-T VariantRecalibrator \
-R /PATH/hg19index2.fa \
-input XXX.vcf \
-resource:hapmap,known=-false,training=true,truth=true,prior=15.0 /PATH/GATK_Resources/hapmap_3.3.hg19.sites.vcf \
-resource:omni,known=false,training=true,truth=true,prior=12.0 /PATH/GATK_Resources/1000G_omni2.5.hg19.sites.vcf \
-resource:1000G,known=false,training=true,truth=false,prior=10.0 /PATH/GATK_Resources/1000G_phase1.snps.high_confidence.hg19.sites.vcf \
-resource:dbsnp,known=true,training=false,truth=false,prior=2.0 /PATH/GATK_Resources/dbsnp_138.hg19.vcf \
-L /PATH/SeqCap_EZ_Exome_v2.bed \
-an QD \
-an MQ \
-an MQRankSum \
-titv 3 \
-mode SNP \
-tranche 100.0 -tranche 99.5 -tranche 99.0 -tranche 90.0 \
-recalFile XXX_recalSNP.recal \
-tranchesFile XXX.tranches \
-rscriptFile XXX_recalSNP_plots.R \


Which produces

INFO 00:12:38,603 GenomeAnalysisEngine - Strictness is SILENT
INFO 00:12:38,666 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000
INFO 00:12:39,111 GenomeAnalysisEngine - Preparing for traversal
INFO 00:12:39,119 GenomeAnalysisEngine - Done preparing for traversal
INFO 00:12:39,120 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
INFO 00:12:39,120 ProgressMeter - | processed | time | per 1M | | total | remaining
INFO 00:12:39,120 ProgressMeter - Location | sites | elapsed | sites | completed | runtime | runtime
INFO 00:12:39,125 TrainingSet - Found hapmap track: Known = false Training = true Truth = true Prior = Q15.0
INFO 00:12:39,126 TrainingSet - Found omni track: Known = false Training = true Truth = true Prior = Q12.0
INFO 00:12:39,126 TrainingSet - Found 1000G track: Known = false Training = true Truth = false Prior = Q10.0
INFO 00:12:39,126 TrainingSet - Found dbsnp track: Known = true Training = false Truth = false Prior = Q2.0
INFO 00:13:09,124 ProgressMeter - chr1:168282042 6525865.0 30.0 s 4.0 s 5.4% 9.3 m 8.8 m
INFO 00:13:39,126 ProgressMeter - chr2:102475430 1.3439691E7 60.0 s 4.0 s 11.2% 8.9 m 7.9 m
INFO 00:14:09,130 ProgressMeter - chr3:45857514 2.032475E7 90.0 s 4.0 s 17.2% 8.7 m 7.2 m
INFO 00:14:39,133 ProgressMeter - chr4:32353684 2.7060788E7 120.0 s 4.0 s 23.0% 8.7 m 6.7 m
INFO 00:15:09,135 ProgressMeter - chr5:37997884 3.3309406E7 2.5 m 4.0 s 29.3% 8.5 m 6.0 m
INFO 00:15:39,138 ProgressMeter - chr6:34847769 3.972357E7 3.0 m 4.0 s 35.0% 8.6 m 5.6 m
INFO 00:16:09,141 ProgressMeter - chr7:43252920 4.6076952E7 3.5 m 4.0 s 40.7% 8.6 m 5.1 m
INFO 00:16:39,144 ProgressMeter - chr8:59083943 5.2572327E7 4.0 m 4.0 s 46.3% 8.6 m 4.6 m
INFO 00:17:09,147 ProgressMeter - chr9:125329947 5.9087344E7 4.5 m 4.0 s 53.1% 8.5 m 4.0 m
INFO 00:17:39,148 ProgressMeter - chr11:20077519 6.6175312E7 5.0 m 4.0 s 58.5% 8.5 m 3.5 m
INFO 00:18:09,151 ProgressMeter - chr12:55433666 7.3219963E7 5.5 m 4.0 s 64.0% 8.6 m 3.1 m
INFO 00:18:39,154 ProgressMeter - chr14:24684974 7.9821067E7 6.0 m 4.0 s 70.9% 8.5 m 2.5 m
INFO 00:19:09,159 ProgressMeter - chr16:4650284 8.6967833E7 6.5 m 4.0 s 77.0% 8.4 m 116.0 s
INFO 00:19:39,161 ProgressMeter - chr17:77877251 9.4673949E7 7.0 m 4.0 s 82.2% 8.5 m 91.0 s
INFO 00:20:09,164 ProgressMeter - chr20:14506766 1.02436026E8 7.5 m 4.0 s 87.1% 8.6 m 66.0 s
INFO 00:20:39,165 ProgressMeter - chrX:70830524 1.0971454E8 8.0 m 4.0 s 94.1% 8.5 m 30.0 s
INFO 00:20:49,461 VariantDataManager - QD: mean = 17.71 standard deviation = 6.41
INFO 00:20:49,478 VariantDataManager - MQ: mean = 69.82 standard deviation = 1.66
INFO 00:20:49,485 VariantDataManager - MQRankSum: mean = 0.11 standard deviation = 0.69
INFO 00:20:49,494 VariantDataManager - ReadPosRankSum: mean = 0.41 standard deviation = 0.73
INFO 00:20:49,746 VariantDataManager - Annotations are now ordered by their information content: [MQ, QD, MQRankSum, ReadPosRankSum]
INFO 00:20:49,756 VariantDataManager - Training with 74924 variants after standard deviation thresholding.
INFO 00:20:49,759 GaussianMixtureModel - Initializing model with 100 k-means iterations...
INFO 00:20:52,290 VariantRecalibratorEngine - Finished iteration 0.
INFO 00:20:53,325 VariantRecalibratorEngine - Finished iteration 5. Current change in mixture coefficients = 2.03635
INFO 00:20:54,356 VariantRecalibratorEngine - Finished iteration 10. Current change in mixture coefficients = 4.81600
INFO 00:20:55,445 VariantRecalibratorEngine - Finished iteration 15. Current change in mixture coefficients = 0.23736
INFO 00:20:56,538 VariantRecalibratorEngine - Finished iteration 20. Current change in mixture coefficients = 0.23086
INFO 00:20:57,657 VariantRecalibratorEngine - Finished iteration 25. Current change in mixture coefficients = 0.16611
INFO 00:20:58,814 VariantRecalibratorEngine - Finished iteration 30. Current change in mixture coefficients = 0.02066
INFO 00:20:59,969 VariantRecalibratorEngine - Finished iteration 35. Current change in mixture coefficients = 0.00940
INFO 00:21:01,138 VariantRecalibratorEngine - Finished iteration 40. Current change in mixture coefficients = 0.00720
INFO 00:21:02,328 VariantRecalibratorEngine - Finished iteration 45. Current change in mixture coefficients = 0.00735
INFO 00:21:03,512 VariantRecalibratorEngine - Finished iteration 50. Current change in mixture coefficients = 0.00786
INFO 00:21:04,665 VariantRecalibratorEngine - Finished iteration 55. Current change in mixture coefficients = 0.01016
INFO 00:21:05,827 VariantRecalibratorEngine - Finished iteration 60. Current change in mixture coefficients = 0.01475
INFO 00:21:06,995 VariantRecalibratorEngine - Finished iteration 65. Current change in mixture coefficients = 0.01589
INFO 00:21:08,166 VariantRecalibratorEngine - Finished iteration 70. Current change in mixture coefficients = 0.02197
INFO 00:21:09,168 ProgressMeter - chrUn_gl000228:32096 1.1237226E8 8.5 m 4.0 s 100.0% 8.5 m 0.0 s
INFO 00:21:09,354 VariantRecalibratorEngine - Finished iteration 75. Current change in mixture coefficients = 0.03203
INFO 00:21:10,537 VariantRecalibratorEngine - Finished iteration 80. Current change in mixture coefficients = 0.04255
INFO 00:21:11,729 VariantRecalibratorEngine - Finished iteration 85. Current change in mixture coefficients = 0.01019
INFO 00:21:12,933 VariantRecalibratorEngine - Finished iteration 90. Current change in mixture coefficients = 0.00551
INFO 00:21:14,137 VariantRecalibratorEngine - Finished iteration 95. Current change in mixture coefficients = 0.01607
INFO 00:21:15,338 VariantRecalibratorEngine - Finished iteration 100. Current change in mixture coefficients = 0.01040
INFO 00:21:16,546 VariantRecalibratorEngine - Finished iteration 105. Current change in mixture coefficients = 0.14741
INFO 00:21:17,770 VariantRecalibratorEngine - Finished iteration 110. Current change in mixture coefficients = 0.18376
INFO 00:21:18,524 VariantRecalibratorEngine - Convergence after 113 iterations!
INFO 00:21:18,651 VariantRecalibratorEngine - Evaluating full set of 98477 variants...
INFO 00:21:18,661 VariantDataManager - Training with worst 0 scoring variants --> variants with LOD <= -5.0000.
INFO 00:21:19,564 GATKRunReport - Uploaded run statistics report to AWS S3

##### ERROR stack trace

java.lang.IllegalArgumentException: No data found.

##### ERROR ------------------------------------------------------------------------------------------

Would someone be able to help me with this?
Is this because I have a very small chromosomal region ~70kb over which I am trying to identify variants?
If so then is there any other tool that I can use for variant recalibration over a small targeted region.

Also is there anyway to specify my target region so that the resource flag scans for just my target region in the hapmap and dbSNP databases?

Thank you so much in advance and I will be grateful for any help on this topic since I am new to GATK.

Thanks,
Neha

@Neha
Hi again,

You have a few other questions here that were not asked in the other thread, so I will point you to a few other articles that should help

-Sheila

• United StatesMember Posts: 1
edited September 2016

Hi,

I am processing some Exome samples and able to get VQSR (single sample at a time) to run on the majority of them (using --maxGaussians 4). Some do fail out with the No data found error as reported by others in this thread. The above recommendation by Kurt to not include MQ in the mixture model does alleviate the problem but I am left with the whole call set with almost no variants filtered. Most of these samples have reasonably large number of variants (~400,000), I also tried the -minNumBad option but it gave the same error. Is there another option apart from manual filtering or adding more samples which can solve this problem ? Also, what is the recommended number of variants one should have for getting VQSR to build a good model ?

Thank you for your help

We recommend using at least 30 whole exome samples in VQSR. You should not be running on single exome samples.

-Sheila

• Rochester, NY 14627Member Posts: 2

@Kurt , you pointed to the key of this type of errors. In my case, I got rid of -an FS, and it worked. I think that this should be implemented in GATK by checking whether an annotation is informative before training a model. At present, my suggestion is to get rid of each annotation until seeing the program run successfully, then adding back each of those removed one by one until error occurs again, then that just added annotation is the culprit.

