To celebrate the release of GATK 4.0, we are giving away free credits for running the GATK4 Best Practices pipelines in FireCloud, our secure online analysis portal. It’s first come first serve, so sign up now to claim your free credits worth $250. Sponsored by Google Cloud. Learn more at https://software.broadinstitute.org/firecloud/documentation/freecredits

VariantRecalibrator error: Code exception

nb82nb82 United StatesMember

Hi,
I am running GATK on rice libraries and I get the following error with VariantRecalibrator. The command I use is:

java -jar GenomeAnalysisTK-2.8-1-g932cd3a/GenomeAnalysisTK.jar -T VariantRecalibrator -R genome.fasta -input chr5.vcf -mode SNP -recalFile chr5.raw.snps.recal -tranchesFile chr5.raw.snps.tranches -rscriptFile chr5.recal.plots.R -resource:dbSNP,known=true,training=true,truth=true,prior=6.0 all_chromosomes.vcf -an QD -an MQRankSum -an ReadPosRankSum -an FS -an MQ

This is how the program terminates:

INFO 21:42:35,099 ProgressMeter - Location processed.sites runtime per.1M.sites completed total.runtime remaining
INFO 21:42:35,102 TrainingSet - Found dbSNP track: Known = true Training = true Truth = true Prior = Q6.0

INFO 21:43:05,104 ProgressMeter - Chr11:11654383 9.83e+06 30.0 s 3.0 s 87.7% 34.0 s 4.0 s
INFO 21:43:07,413 VariantDataManager - QD: mean = 27.02 standard deviation = 7.70
INFO 21:43:07,440 VariantDataManager - MQRankSum: mean = -0.62 standard deviation = 3.35
INFO 21:43:07,479 VariantDataManager - ReadPosRankSum: mean = 0.25 standard deviation = 1.63
INFO 21:43:07,510 VariantDataManager - FS: mean = 2.11 standard deviation = 9.67
INFO 21:43:07,525 VariantDataManager - MQ: mean = 47.09 standard deviation = 11.34
INFO 21:43:07,660 VariantDataManager - Annotations are now ordered by their information content: [MQ, QD, FS, MQRankSum, ReadPosRankSum]
INFO 21:43:07,680 VariantDataManager - Training with 7845 variants after standard deviation thresholding.

INFO 21:43:09,833 VariantRecalibratorEngine - Convergence after 93 iterations!
INFO 21:43:09,872 VariantRecalibratorEngine - Evaluating full set of 272372 variants...
INFO 21:43:09,884 VariantDataManager - Training with worst 0 scoring variants --> variants with LOD <= -5.0000.
INFO 21:43:10,854 GATKRunReport - Uploaded run statistics report to AWS S3

ERROR ------------------------------------------------------------------------------------------
ERROR stack trace

java.lang.NullPointerException
at org.broadinstitute.sting.gatk.walkers.variantrecalibration.VariantRecalibratorEngine.generateModel(VariantRecalibratorEngine.java:83)
at org.broadinstitute.sting.gatk.walkers.variantrecalibration.VariantRecalibrator.onTraversalDone(VariantRecalibrator.java:359)
at org.broadinstitute.sting.gatk.walkers.variantrecalibration.VariantRecalibrator.onTraversalDone(VariantRecalibrator.java:139)
at org.broadinstitute.sting.gatk.executive.Accumulator$StandardAccumulator.finishTraversal(Accumulator.java:129)
at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:116)
at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:313)
at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113)
at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:245)
at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:152)
at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:91)

ERROR ------------------------------------------------------------------------------------------
ERROR A GATK RUNTIME ERROR has occurred (version 2.8-1-g932cd3a):
ERROR
ERROR MESSAGE: Code exception (see stack trace for error itself)
ERROR ------------------------------------------------------------------------------------------

I have read similar problems on GATK forums and based on those, it seems to me that the training set of VCFs is too small for my data. Is that so? If so, can you please tell me how can I fix it? This is for rice and I only have 1 set of known VCFs from dbSNP.

Can you also please confirm if my settings for 'known', 'training' and 'truth' are correct.

Thanks a lot in advance.

Best Answer

Answers

  • nb82nb82 United StatesMember

    Thanks Geraldine,
    So, is 7845 the total number of variants in my .vcf file? Is that the problem?
    I have found the following documentation on hard-filtering. I will apply this to my set.
    However, can you please direct me to any document that explains the implications of using hard-filtering as opposed to Variant Recalibration?

    Thanks

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    That's correct. We don't have a document that compares VQSR and hard-filtering systematically, but basically the problem with hard-filtering is that you don't get all the benefits of VQSR (which are documented in the VQSR docs and presentations). Hard filtering forces you to choose specific thresholds for annotation values that will be applied to all variants, whereas VQSR calculates a ranking score for the variants which takes into account different possible combinations of values. Does that make sense?

  • nb82nb82 United StatesMember

    Hi Geraldine, I went back to my files and saw that the chr5.vcf file is actually 53MB and has more than 330K variants and there are more than 10M in the training file. Can you please tell me how did the program arrive at 7845 variants? Also, can you please tell me what does it say about my data that 300K variants get reduced to 7845 variants?

    Thanks,
    Nitin

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Oh, hmm. How many variants are there in your dbsnp file? The model is trained with variants that overlap between the resource (here, dbsnp) and your callset. Maybe there's not enough overlap.

  • nb82nb82 United StatesMember

    There are a total of >10M variants in the dbSNP file out of which 884K are on chr5 which is what I am analyzing here. chr5.vcf, which is my call set, has more than 300K variants belonging to only to Chr5.
    Also, is it a problem if my dbSNP has SNPs from the entire genome where as my call set has SNPs only from chr5?
    Thanks for the help, I am struggling to get this to work.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    I'm a little surprised there would be not so much overlap between your chr5 and the dbsnp, but that could explain it. I would definitely recommend running the whole genome (or exome) rather than just one chromosome through VQSR. As algorithms go it's very greedy for data.

Sign In or Register to comment.