# Stack trace error with GATK

Hi, I'm currently running GATK VariantRecalibrator on a number of scaffolds from a plant genome assembly with only a few hundred KBs. We're only looking at a few scaffolds because they contain genes of interest, plus resources are limited to run a full RNA-seq analysis on the entire 4.4Gb genome assembly. I've taken the highest quality score snps and indels as my true variants, as suggested in the VQSR online documentation.

I've been getting the following trace stack error:

INFO 14:51:26,349 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]

INFO 14:51:26,350 ProgressMeter - | processed | time | per 1M | | total | remaining

INFO 14:51:26,350 ProgressMeter - Location | sites | elapsed | sites | completed | runtime | runtime

INFO 14:51:26,376 TrainingSet - Found highconf track: Known = false Training = true Truth = true Prior = Q1365.0

INFO 14:51:26,377 TrainingSet - Found nonetrue track: Known = false Training = true Truth = false Prior = Q10.0

INFO 14:51:26,377 TrainingSet - Found allsnp track: Known = true Training = false Truth = false Prior = Q2.0

INFO 14:51:27,655 VariantDataManager - AB: mean = 0.21 standard deviation = 0.23

INFO 14:51:27,657 VariantDataManager - ABP: mean = 413.89 standard deviation = 2754.96

INFO 14:51:27,658 VariantDataManager - AC: mean = 1.44 standard deviation = 0.50

INFO 14:51:27,659 VariantDataManager - AF: mean = 0.72 standard deviation = 0.25

INFO 14:51:27,661 VariantDataManager - AO: mean = 256.71 standard deviation = 1368.91

INFO 14:51:27,662 VariantDataManager - DP: mean = 786.62 standard deviation = 4502.47

INFO 14:51:27,663 VariantDataManager - DPB: mean = 786.62 standard deviation = 4502.47

INFO 14:51:27,664 VariantDataManager - EPP: mean = 60.76 standard deviation = 666.25

INFO 14:51:27,666 VariantDataManager - EPPR: mean = 33.16 standard deviation = 282.27

INFO 14:51:27,667 VariantDataManager - MEANALT: mean = 1.34 standard deviation = 0.66

INFO 14:51:27,668 VariantDataManager - MQM: mean = 52.80 standard deviation = 9.93

INFO 14:51:27,670 VariantDataManager - MQMR: mean = 30.90 standard deviation = 26.49

INFO 14:51:27,671 VariantDataManager - NUMALT: mean = 1.01 standard deviation = 0.11

INFO 14:51:27,671 VariantDataManager - ODDS: mean = 1236.30 standard deviation = 8150.78

INFO 14:51:27,672 VariantDataManager - PAIRED: mean = 0.72 standard deviation = 0.35

INFO 14:51:27,673 VariantDataManager - PAIREDR: mean = 0.40 standard deviation = 0.41

INFO 14:51:27,673 VariantDataManager - QA: mean = 17225.55 standard deviation = 92064.01

INFO 14:51:27,674 VariantDataManager - QR: mean = 32825.42 standard deviation = 209629.68

INFO 14:51:27,675 VariantDataManager - RO: mean = 485.64 standard deviation = 3103.30

INFO 14:51:27,675 VariantDataManager - RPL: mean = 129.77 standard deviation = 776.53

INFO 14:51:27,676 VariantDataManager - RPP: mean = 164.92 standard deviation = 1625.97

INFO 14:51:27,677 VariantDataManager - RPPR: mean = 125.85 standard deviation = 1058.25

INFO 14:51:27,678 VariantDataManager - RPR: mean = 126.94 standard deviation = 840.44

INFO 14:51:27,679 VariantDataManager - SAF: mean = 135.54 standard deviation = 860.84

INFO 14:51:27,679 VariantDataManager - SAP: mean = 134.77 standard deviation = 1213.51

INFO 14:51:27,680 VariantDataManager - SAR: mean = 121.17 standard deviation = 686.94

INFO 14:51:27,681 VariantDataManager - SRF: mean = 241.44 standard deviation = 1608.40

INFO 14:51:27,682 VariantDataManager - SRP: mean = 179.43 standard deviation = 1473.39

INFO 14:51:27,682 VariantDataManager - SRR: mean = 244.20 standard deviation = 1675.14

INFO 14:51:27,708 VariantDataManager - Annotations are now ordered by their information content: [QR, QA, ODDS, DP, DPB, RO, ABP, AO, SRR, SRF, SRP, RPP, SAF, SAP, RPL, RPR, RPPR, SAR, EPP, MQM, EPPR, MQMR, MEANALT, AC, NUMALT, AF, PAIRED, AB, PAIREDR]

INFO 14:51:27,708 VariantDataManager - Training with 1403 variants after standard deviation thresholding.

INFO 14:51:27,713 GaussianMixtureModel - Initializing model with 100 k-means iterations...

INFO 14:51:27,929 VariantRecalibratorEngine - Finished iteration 0.

##### ERROR ------------------------------------------------------------------------------------------

##### ERROR stack trace

java.lang.StackOverflowError

at org.apache.commons.math.special.Gamma.digamma(Gamma.java:310)

at org.apache.commons.math.special.Gamma.digamma(Gamma.java:310)

at org.apache.commons.math.special.Gamma.digamma(Gamma.java:310)

at org.apache.commons.math.special.Gamma.digamma(Gamma.java:310)

at org.apache.commons.math.special.Gamma.digamma(Gamma.java:310)

at org.apache.commons.math.special.Gamma.digamma(Gamma.java:310)

at org.apache.commons.math.special.Gamma.digamma(Gamma.java:310)

at org.apache.commons.math.special.Gamma.digamma(Gamma.java:310)

at org.apache.commons.math.special.Gamma.digamma(Gamma.java:310)

at org.apache.commons.math.special.Gamma.digamma(Gamma.java:310)

at org.apache.commons.math.special.Gamma.digamma(Gamma.java:310)

at org.apache.commons.math.special.Gamma.digamma(Gamma.java:310)

at org.apache.commons.math.special.Gamma.digamma(Gamma.java:310)

at org.apache.commons.math.special.Gamma.digamma(Gamma.java:310)

at org.apache.commons.math.special.Gamma.digamma(Gamma.java:310)

Thank you for any help you can provide.

Cheers

Brett

## Answers

@bchapman I think your problem is that your training set is too small.

@bchapman

Hi Brett,

I think Tommy is correct. How many samples are in your dataset? We recommend running VQSR with at least 30 whole exome samples or 1 whole genome sample.

-Sheila

Holy hannah, that's a lot of annotations. My guess is that there's too many dimensions, hence the stack overflow in the math function. You may not even be getting to the point in the program where having too few datapoints causes problems, afaict. Though that is clearly going to be too few variants.

Thanks for the feedback. I'll try reducing the number of annotations used, but which ones should I choose and how many? I don't know what each annotation means.

In terms of sample size, the scaffolds from the assembly were selected because they contain regions of interest, and running the whole genome was not an option. The service provider who supplies a super computer with enough memory only sets a walltime of 24 hours, so any RNA-seq alignments I do would have to complete within that time. Which is ridiculous for bioinformatics.

@bchapman

Hi,

You can learn about the GATK annotations here: https://www.broadinstitute.org/gatk/guide/tooldocs/

-Sheila

Thanks for pointing me to the list of annotations. From the list I chose 3 annotations which were in my VCF files.

I am now getting this error. I'm not sure why it is complaining about no data, because the paths to the files are all correct. I'm thinking its a different error, after perhaps an error occurred when uploading to aws. I don't know why it reports to aws, is there any reason it should connect to aws?

Thanks.

INFO 11:28:54,619 HelpFormatter - ---------------------------------------------------------------------------------

INFO 11:28:54,622 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.4-46-gbc02625, Compiled 2015/07/09 17:38:12

INFO 11:28:54,622 HelpFormatter - Copyright (c) 2010 The Broad Institute

INFO 11:28:54,622 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk

INFO 11:28:54,627 HelpFormatter - Program Args: -T VariantRecalibrator -R /brldata/ccgstaff/bchapman/Barley/variantCalls/reference/FlowerScaffolds_inEC1-S2.2-3085_fullcovered.fasta -input ../EC17.2_mergedFlower_mutant_fullcov_unified.sorted.vcf -resource:highconf,known=false,training=true,truth=true,prior=1365.0 /brldata/ccgstaff/bchapman/Barley/variantCalls/snp_highConfFiltered.recode.sorted.vcf -resource:nonetrue,known=false,training=true,truth=false,prior=10.0 /brldata/ccgstaff/bchapman/Barley/variantCalls/snp_NonTrue.recode.sorted.vcf -resource:allsnp,known=true,training=false,truth=false,prior=2.0 /brldata/ccgstaff/bchapman/Barley/variantCalls/snp_All.recode.sorted.vcf -an AB -an AF -an DP -mode SNP -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 -recalFile /brldata/ccgstaff/bchapman/Barley/variantCalls/run/output/recalibrate_SNP.recal -tranchesFile /brldata/ccgstaff/bchapman/Barley/variantCalls/run/output/recalibrate_SNP.tranches -rscriptFile /brldata/ccgstaff/bchapman/Barley/variantCalls/run/output/recalibrate_SNP_plots.R

INFO 11:28:54,631 HelpFormatter - Executing as ccgstaff@syd-cluster-prod-azb-trusty-node001 on Linux 3.13.0-34-generic amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_51-b13.

INFO 11:28:54,632 HelpFormatter - Date/Time: 2015/09/07 11:28:54

INFO 11:28:54,632 HelpFormatter - ---------------------------------------------------------------------------------

INFO 11:28:54,632 HelpFormatter - ---------------------------------------------------------------------------------

INFO 11:28:55,208 GenomeAnalysisEngine - Strictness is SILENT

INFO 11:28:55,317 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000

INFO 11:28:55,537 GenomeAnalysisEngine - Preparing for traversal

INFO 11:28:55,539 GenomeAnalysisEngine - Done preparing for traversal

INFO 11:28:55,540 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]

INFO 11:28:55,540 ProgressMeter - | processed | time | per 1M | | total | remaining

INFO 11:28:55,541 ProgressMeter - Location | sites | elapsed | sites | completed | runtime | runtime

INFO 11:28:55,548 TrainingSet - Found highconf track: Known = false Training = true Truth = true Prior = Q1365.0

INFO 11:28:55,549 TrainingSet - Found nonetrue track: Known = false Training = true Truth = false Prior = Q10.0

INFO 11:28:55,549 TrainingSet - Found allsnp track: Known = true Training = false Truth = false Prior = Q2.0

INFO 11:28:56,591 VariantDataManager - AB: mean = 0.21 standard deviation = 0.23

INFO 11:28:56,593 VariantDataManager - AF: mean = 0.72 standard deviation = 0.25

INFO 11:28:56,594 VariantDataManager - DP: mean = 786.62 standard deviation = 4502.47

INFO 11:28:56,608 VariantDataManager - Annotations are now ordered by their information content: [DP, AF, AB]

INFO 11:28:56,609 VariantDataManager - Training with 1412 variants after standard deviation thresholding.

INFO 11:28:56,613 GaussianMixtureModel - Initializing model with 100 k-means iterations...

INFO 11:28:56,764 VariantRecalibratorEngine - Finished iteration 0.

INFO 11:28:56,842 VariantRecalibratorEngine - Finished iteration 5. Current change in mixture coefficients = 0.39657

INFO 11:28:56,871 VariantRecalibratorEngine - Finished iteration 10. Current change in mixture coefficients = 0.10571

INFO 11:28:56,896 VariantRecalibratorEngine - Finished iteration 15. Current change in mixture coefficients = 0.11083

INFO 11:28:56,922 VariantRecalibratorEngine - Finished iteration 20. Current change in mixture coefficients = 0.09283

INFO 11:28:56,946 VariantRecalibratorEngine - Finished iteration 25. Current change in mixture coefficients = 0.04922

INFO 11:28:56,970 VariantRecalibratorEngine - Finished iteration 30. Current change in mixture coefficients = 0.04880

INFO 11:28:56,995 VariantRecalibratorEngine - Finished iteration 35. Current change in mixture coefficients = 0.01370

INFO 11:28:57,018 VariantRecalibratorEngine - Finished iteration 40. Current change in mixture coefficients = 0.00341

INFO 11:28:57,038 VariantRecalibratorEngine - Convergence after 44 iterations!

INFO 11:28:57,055 VariantRecalibratorEngine - Evaluating full set of 1995 variants...

INFO 11:28:57,056 VariantDataManager - Training with worst 0 scoring variants --> variants with LOD <= -5.0000.

INFO 11:28:58,852 GATKRunReport - Uploaded run statistics report to AWS S3

## ERROR ------------------------------------------------------------------------------------------

## ERROR stack trace

java.lang.IllegalArgumentException: No data found.

at org.broadinstitute.gatk.tools.walkers.variantrecalibration.VariantRecalibratorEngine.generateModel(VariantRecalibratorEngine.java:88)

at org.broadinstitute.gatk.tools.walkers.variantrecalibration.VariantRecalibrator.onTraversalDone(VariantRecalibrator.java:408)

at org.broadinstitute.gatk.tools.walkers.variantrecalibration.VariantRecalibrator.onTraversalDone(VariantRecalibrator.java:156)

at org.broadinstitute.gatk.engine.executive.Accumulator$StandardAccumulator.finishTraversal(Accumulator.java:129)

at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:116)

at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:315)

at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121)

at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248)

at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155)

at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:106)

## ERROR ------------------------------------------------------------------------------------------

@bchapman

Hi Brett,

Yes, the issue is that you do not have enough variants for VQSR to work properly. The best thing to do is use hard filtering. http://gatkforums.broadinstitute.org/discussion/2806/howto-apply-hard-filters-to-a-call-set

You may need to tweak the thresholds to better fit your data.

-Sheila