VariantRecalibrator removing all SNPs from the training set

Hi,

I've recently updated my GATK pipeline for population-level variant calling of dog WGS data to take advantage of the new GenotypeGVCFs module instead of relying on UnifiedGenotyper or HaplotypeCaller to handle dozens of samples simultaneously. Unfortunately, now when I try to recalibrate my variants using the truth set of ~150K SNPs I was using before, the training eventually converges by excluding all the training variants, leading to a crash. At first I was running the command multithreaded, but this also happens when it's run as a single thread (see stack trace below). Any ideas what's going on?

Thanks,
Adam

INFO 16:40:52,233 HelpFormatter - --------------------------------------------------------------------------------
INFO 16:40:52,236 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.1-1-g07a4bf8, Compiled 2014/03/18 06:09:21
INFO 16:40:52,236 HelpFormatter - Copyright (c) 2010 The Broad Institute
INFO 16:40:52,237 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk
INFO 16:40:52,241 HelpFormatter - Program Args: -T VariantRecalibrator -R canFam3.1.fa -input raw.debug.vcf -resource:illumina,known=false,training=true,truth=true,prior=12.0 sites.cf3.1.vcf -an DP -an QD -an MQRankSum -an FS -an ReadPosRankSum -an MQ -mode SNP -tranche 100.0 -tranche 99.9 -tranche 99.5 -tranche 99.0 -tranche 95.0 -tranche 90.0 -recalFile snp.debug.recal -tranchesFile snp.debug.tranches -rscriptFile snp.debug.plots.R
INFO 16:40:52,740 HelpFormatter - Executing on Linux 2.6.32-358.14.1.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_51-b13.
INFO 16:40:52,740 HelpFormatter - Date/Time: 2014/06/02 16:40:52
INFO 16:40:52,741 HelpFormatter - --------------------------------------------------------------------------------
INFO 16:40:52,741 HelpFormatter - --------------------------------------------------------------------------------
INFO 16:40:53,837 GenomeAnalysisEngine - Strictness is SILENT
INFO 16:40:54,288 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000
INFO 16:40:55,647 GenomeAnalysisEngine - Preparing for traversal
INFO 16:40:55,686 GenomeAnalysisEngine - Done preparing for traversal
INFO 16:40:55,704 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
INFO 16:40:55,722 ProgressMeter - Location processed.sites runtime per.1M.sites completed total.runtime remaining
INFO 16:40:56,264 TrainingSet - Found illumina track: Known = false Training = true Truth = true Prior = Q12.0
INFO 16:41:26,324 ProgressMeter - chr1:119694011 1.33e+06 30.0 s 23.0 s 5.0% 10.1 m 9.6 m
INFO 16:41:57,528 ProgressMeter - chr10:24041205 1.65e+06 61.0 s 37.0 s 6.1% 16.7 m 15.7 m
INFO 16:42:37,192 ProgressMeter - chr12:3177469 3.13e+06 101.0 s 32.0 s 11.2% 15.1 m 13.4 m
INFO 16:43:09,795 ProgressMeter - chr14:46556735 5.37e+06 2.2 m 24.0 s 18.6% 12.0 m 9.8 m
INFO 16:43:42,333 ProgressMeter - chr17:46123438 7.70e+06 2.8 m 21.0 s 26.2% 10.5 m 7.8 m
INFO 16:44:20,667 ProgressMeter - chr2:56532172 1.00e+07 3.4 m 20.0 s 33.9% 10.0 m 6.6 m
INFO 16:44:51,269 ProgressMeter - chr22:26635764 1.21e+07 3.9 m 19.0 s 40.7% 9.6 m 5.7 m
INFO 16:45:22,151 ProgressMeter - chr25:8747205 1.38e+07 4.4 m 19.0 s 46.6% 9.5 m 5.1 m
INFO 16:45:52,154 ProgressMeter - chr27:41179493 1.56e+07 4.9 m 18.0 s 51.7% 9.5 m 4.6 m
INFO 16:46:22,532 ProgressMeter - chr3:13905602 1.69e+07 5.4 m 19.0 s 55.9% 9.7 m 4.3 m
INFO 16:46:52,534 ProgressMeter - chr3:77994124 1.76e+07 5.9 m 20.0 s 58.6% 10.1 m 4.2 m
INFO 16:47:22,536 ProgressMeter - chr30:9518519 1.80e+07 6.4 m 21.0 s 59.6% 10.8 m 4.4 m
INFO 16:47:52,538 ProgressMeter - chr30:36929683 1.83e+07 6.9 m 22.0 s 60.7% 11.4 m 4.5 m
INFO 16:48:24,345 ProgressMeter - chr31:13298903 1.86e+07 7.5 m 24.0 s 61.4% 12.2 m 4.7 m
INFO 16:48:54,346 ProgressMeter - chr31:38887091 1.89e+07 8.0 m 25.0 s 62.4% 12.8 m 4.8 m
INFO 16:49:24,348 ProgressMeter - chr32:16826952 1.92e+07 8.5 m 26.0 s 63.2% 13.4 m 4.9 m
INFO 16:49:54,350 ProgressMeter - chr33:9665516 1.97e+07 9.0 m 27.0 s 64.5% 13.9 m 4.9 m
INFO 16:50:24,352 ProgressMeter - chr34:8616301 2.01e+07 9.5 m 28.0 s 65.8% 14.4 m 4.9 m
INFO 16:50:54,354 ProgressMeter - chr34:41058797 2.06e+07 10.0 m 29.0 s 67.1% 14.9 m 4.9 m
INFO 16:51:26,984 ProgressMeter - chr35:20891681 2.09e+07 10.5 m 30.0 s 68.0% 15.5 m 4.9 m
INFO 16:51:58,253 ProgressMeter - chr38:14260224 2.20e+07 11.0 m 30.0 s 71.4% 15.5 m 4.4 m
INFO 16:52:32,120 ProgressMeter - chr4:78859835 2.31e+07 11.6 m 30.0 s 75.1% 15.5 m 3.9 m
INFO 16:53:02,719 ProgressMeter - chr6:24630819 2.46e+07 12.1 m 29.0 s 80.2% 15.1 m 3.0 m
INFO 16:53:32,858 ProgressMeter - chr7:70954178 2.60e+07 12.6 m 29.0 s 85.3% 14.8 m 2.2 m
INFO 16:54:11,822 ProgressMeter - chr9:39193117 2.76e+07 13.3 m 28.0 s 90.4% 14.7 m 84.0 s
INFO 16:54:43,302 ProgressMeter - chrUn_JH373262:194062 2.91e+07 13.8 m 28.0 s 97.3% 14.2 m 22.0 s
INFO 16:55:17,832 ProgressMeter - chrUn_AAEX03023557:310 2.98e+07 14.4 m 28.0 s 99.2% 14.5 m 6.0 s
INFO 16:55:23,341 VariantDataManager - DP: mean = 581.76 standard deviation = 110.45
INFO 16:55:26,420 VariantDataManager - QD: mean = 19.65 standard deviation = 5.17
INFO 16:55:29,554 VariantDataManager - MQRankSum: mean = 0.15 standard deviation = 0.38
INFO 16:55:32,879 VariantDataManager - FS: mean = 3.29 standard deviation = 3.73
INFO 16:55:36,112 VariantDataManager - ReadPosRankSum: mean = 0.25 standard deviation = 0.38
INFO 16:55:39,447 VariantDataManager - MQ: mean = 59.98 standard deviation = 0.43
INFO 16:56:34,985 VariantDataManager - Annotations are now ordered by their information content: [DP, MQ, QD, FS, MQRankSum, ReadPosRankSum]
INFO 16:56:35,732 VariantDataManager - Training with 157123 variants after standard deviation thresholding.
INFO 16:56:35,740 GaussianMixtureModel - Initializing model with 100 k-means iterations...
INFO 16:57:20,532 VariantRecalibratorEngine - Finished iteration 0.
INFO 16:57:32,908 VariantRecalibratorEngine - Finished iteration 5. Current change in mixture coefficients = 1.28438
INFO 16:57:45,208 VariantRecalibratorEngine - Finished iteration 10. Current change in mixture coefficients = 0.74161
INFO 16:57:57,279 VariantRecalibratorEngine - Finished iteration 15. Current change in mixture coefficients = 7.59772
INFO 16:58:09,478 VariantRecalibratorEngine - Finished iteration 20. Current change in mixture coefficients = 0.28910
INFO 16:58:22,278 VariantRecalibratorEngine - Finished iteration 25. Current change in mixture coefficients = 0.05857
INFO 16:58:34,804 VariantRecalibratorEngine - Finished iteration 30. Current change in mixture coefficients = 0.05605
INFO 16:58:47,116 VariantRecalibratorEngine - Finished iteration 35. Current change in mixture coefficients = 0.05180
INFO 16:58:59,115 VariantRecalibratorEngine - Finished iteration 40. Current change in mixture coefficients = 0.04196
INFO 16:59:11,329 VariantRecalibratorEngine - Finished iteration 45. Current change in mixture coefficients = 0.03361
INFO 16:59:23,682 VariantRecalibratorEngine - Finished iteration 50. Current change in mixture coefficients = 0.02521
INFO 16:59:38,689 VariantRecalibratorEngine - Finished iteration 55. Current change in mixture coefficients = 0.02037
INFO 16:59:51,346 VariantRecalibratorEngine - Finished iteration 60. Current change in mixture coefficients = 0.01612
INFO 17:00:03,973 VariantRecalibratorEngine - Finished iteration 65. Current change in mixture coefficients = 0.01269
INFO 17:00:16,765 VariantRecalibratorEngine - Finished iteration 70. Current change in mixture coefficients = 0.00867
INFO 17:00:29,690 VariantRecalibratorEngine - Finished iteration 75. Current change in mixture coefficients = 0.00473
INFO 17:00:42,394 VariantRecalibratorEngine - Finished iteration 80. Current change in mixture coefficients = 0.00248
INFO 17:00:47,807 VariantRecalibratorEngine - Convergence after 82 iterations!
INFO 17:00:50,993 VariantDataManager - Training with worst 0 scoring variants --> variants with LOD <= -5.0000.
INFO 17:01:12,880 GATKRunReport - Uploaded run statistics report to AWS S3

ERROR ------------------------------------------------------------------------------------------
ERROR stack trace

java.lang.IllegalArgumentException: No data found.
at org.broadinstitute.sting.gatk.walkers.variantrecalibration.VariantRecalibratorEngine.generateModel(VariantRecalibratorEngine.java:83)
at org.broadinstitute.sting.gatk.walkers.variantrecalibration.VariantRecalibrator.onTraversalDone(VariantRecalibrator.java:392)
at org.broadinstitute.sting.gatk.walkers.variantrecalibration.VariantRecalibrator.onTraversalDone(VariantRecalibrator.java:138)
at org.broadinstitute.sting.gatk.executive.Accumulator$StandardAccumulator.finishTraversal(Accumulator.java:129)
at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:116)
at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:313)
at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:121)
at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:248)
at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:155)
at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:107)

ERROR ------------------------------------------------------------------------------------------
ERROR A GATK RUNTIME ERROR has occurred (version 3.1-1-g07a4bf8):
ERROR
ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
ERROR If not, please post the error message, with stack trace, to the GATK forum.
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR MESSAGE: No data found.
ERROR ------------------------------------------------------------------------------------------

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hi Adam,

    That's definitely weird. Can you tell me what exact commands you used for calling and genotyping?

  • Thanks, Geraldine. Below are the commands I used for the GVCF genotyping and variant recalibration (minus ~80 --variant lines; yes, I realize I should make a list of GVCF files as pass that as a single command line argument). From what I can tell, the GVCF genotyping step ran fine (the resulting raw vcf file looks reasonable in terms of number of SNPs/indels, sample heterozygosity, etc).

    Adam

    java -Xmx32g -Djava.io.tmpdir=$wd -jar /programs/bin/GATK/GenomeAnalysisTK.jar \
    -nt 12 -T GenotypeGVCFs -R canFam3.1.fa \
    --dbsnp sites.cf3.1.vcf \
    --variant 7294.gvcf \
    --variant 7295.gvcf \
    --variant 7296.gvcf \
    --variant 7297.gvcf \
    ....
    --variant 7555.gvcf \
    -o raw.debug.vcf

    followed by:
    java -Xmx32g -Djava.io.tmpdir=$wd -jar /programs/bin/GATK/GenomeAnalysisTK.jar \
    -T VariantRecalibrator -R canFam3.1.fa \
    -input raw.debug.vcf \
    -resource:illumina,known=false,training=true,truth=true,prior=12.0 sites.cf3.1.vcf \
    -an DP -an QD -an MQRankSum -an FS -an ReadPosRankSum -an MQ \
    -mode SNP -tranche 100.0 -tranche 99.9 -tranche 99.5 -tranche 99.0 -tranche 95.0 -tranche 90.0 \
    -recalFile snp.debug.recal -tranchesFile snp.debug.tranches -rscriptFile snp.debug.plots.R

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hi Adam,

    Thanks for the info -- still not sure what's wrong though. Have you tried using only a subset of the gvcf files? I'd like to take a look at this locally if you don't mind sharing the files with us. I'd need your reference, a couple of the gvcfs, and the sites file you use as resource (including all index files please). Instructions for uploading are here: http://www.broadinstitute.org/gatk/guide/article?id=1894

  • Thanks, Geraldine. I've run a few tests and can confirm that no matter what subset of gVCF files I use, the downstream VariantRecalibrator step fails. I just uploaded the reference, the raw vcf file for one of the runs, and the sites file (along with the command and the stack output for that run) to your ftp site in the folder gatk_dog_debug1. Since the error is occurring at the VariantRecalibrator step, I didn't include in the gvcf files that were used in the GenotypeGVCFs step, but can certainly send those too if they are needed. Any help is much appreciated!

    Thanks,
    Adam

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Thanks Adam, we'll have a look at these and let you know what we find. FYI it may take a few days before we can process this because we have a lot going on right now.

  • Hi Geraldine,

    Any luck with the files? I'm hoping it's just a minor bug on my end that's preventing VariantRecalibrator from working properly, but I haven't managed to figure it out.

    Thanks,
    Adam

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    @seq2discover I have an idea but haven't had time to follow up on it yet, sorry. I'm hoping to get to it later today.

  • Thanks, Geraldine. Much appreciated!

  • Hi Geraldine,

    Do you think it's a problem with how I created the gvcf files, or some problem with the options I'm passing to the VariantRecalibrator?

    Thanks,
    Adam

  • Ok, it looks like it was the MQ filter that was causing the headache. I removed it and it seems to work now.

  • tommycarstensentommycarstensen United KingdomMember
    edited November 2014

    I have the same problem with VariantRecalibrator3.3:

    INFO  09:26:13,853 VariantDataManager - Training with 398412 variants after standard deviation thresholding. 
    INFO  09:32:37,426 VariantRecalibratorEngine - Evaluating full set of 972995 variants... 
    INFO  09:50:36,007 VariantDataManager - Training with worst 0 scoring variants --> variants with LOD <= -5.0000.
    ....
    ##### ERROR MESSAGE: No data found.
    

    I'll look into a solution later today. I'm running in SNP mode with recommended options; e.g. --maxGaussians 8 (default). I am running with -nt 1. I will try to remove the flag entirely to see if that makes a difference for some weird reason. I will try to run without -an MQ and/or --maxGaussians 4 as described in this thread:

    http://gatkforums.broadinstitute.org/discussion/3952/variantrecalibrator-no-data-found
    
  • tommycarstensentommycarstensen United KingdomMember

    Yes, changing --maxGaussians from the default 8 to 4 worked for me in SNP mode. I'm not sure why that is. I would like to understand it.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    VR is a sensitive beast -- when you ask it to try to cluster the data into 8 clusters, but it doesn't find a good way to produce that amount of resolution, it just flips out and drops everything*. Obviously, better behavior would be for it to go "Sorry, I can't find 8 clusters that make sense to me in this pile of data, but I have 4 or 5 that I think you might like"...

    *Not the official technical description of what the code is doing.

Sign In or Register to comment.