gatk VariantRecalibrator used to work : adding more samples returns "No data found"

gatk VariantRecalibrator (v3.7) used to work on my WGS data (about 600 samples) . I've added 100 more samples (via .gvcf) to my data set and now I've got a No data found.

I'm just surprised that I get this message after adding more samples. How can I get this error when I'm 'just' adding more data to train.

INFO  09:35:00,234 VariantDataManager - QD:      mean = 19.63    standard deviation = 4.90 
INFO  09:35:00,336 VariantDataManager - FS:      mean = 1.61     standard deviation = 3.23 
INFO  09:35:00,421 VariantDataManager - SOR:     mean = 0.73     standard deviation = 0.27 
INFO  09:35:00,485 VariantDataManager - MQ:      mean = 38.37    standard deviation = 19.78 
INFO  09:35:00,549 VariantDataManager - MQRankSum:   mean = 0.02     standard deviation = 0.40 
INFO  09:35:00,612 VariantDataManager - ReadPosRankSum:      mean = 0.32     standard deviation = 0.53 
INFO  09:35:00,675 VariantDataManager - InbreedingCoeff:     mean = 0.01     standard deviation = 0.07 
INFO  09:35:01,391 VariantDataManager - Annotations are now ordered by their information content: [MQ, QD, FS, ReadPosRankSum, SOR, MQRankSum, InbreedingCoeff] 
INFO  09:35:01,425 VariantDataManager - Training with 386960 variants after standard deviation thresholding. 
INFO  09:35:01,429 GaussianMixtureModel - Initializing model with 100 k-means iterations... 
INFO  09:35:15,123 ProgressMeter -  chr18:78016669   2800393.0     8.5 m       3.0 m       99.9%     8.5 m       0.0 s 

INFO  09:39:15,324 ProgressMeter -  chr18:78016669   2800393.0    12.5 m       4.5 m       99.9%    12.5 m       0.0 s 
INFO  09:39:17,150 VariantRecalibratorEngine - Finished iteration 125.  Current change in mixture coefficients = 0.00239 
INFO  09:39:22,734 VariantRecalibratorEngine - Convergence after 128 iterations! 
INFO  09:39:23,872 VariantDataManager - Training with worst 0 scoring variants --> variants with LOD <= -5.0000. 
##### ERROR --
##### ERROR stack trace 
java.lang.IllegalArgumentException: No data found.
    at org.broadinstitute.gatk.engine.executive.Accumulator$StandardAccumulator.finishTraversal(
    at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(
    at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(
    at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(
    at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(
    at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(
    at org.broadinstitute.gatk.engine.CommandLineGATK.main(
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A GATK RUNTIME ERROR has occurred (version 3.7-0-gcfedb67):
##### ERROR
##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
##### ERROR If not, please post the error message, with stack trace, to the GATK forum.
##### ERROR Visit our website and forum for extensive documentation and answers to 
##### ERROR commonly asked questions
##### ERROR
##### ERROR MESSAGE: No data found.
##### ERROR ------------------------------------------------------------------------------------------
#executed by process    1 in 764.869s   with status 0 : JETER/targets.bash 5

Best Answer


  • SkyWarriorSkyWarrior TurkeyMember

    Can you try by reducing the number of gaussians to something less than 8? Preferably 4.

  • lindenblindenb FranceMember

    @SkyWarrior I will. But I especially want to understand how this can happen: more sample/variants -> error.

  • SkyWarriorSkyWarrior TurkeyMember
    edited November 2017

    Yep this is an interesting finding. I had a similar issue in the past. Running variant recalibrator on a low quality low number of sample set was successful however on a high number high quality sample set this failed and I could get away by reducing the gaussians to 4. I also wonder why that happens. (Although the whole recalibration was good with gaussians equal to 4. This was also INDEL mode as far as I remember. ) I got the tip from the how to's.


    'This is the maximum number of Gaussians (i.e. clusters of variants that have similar properties) that the program should try to identify when it runs the variational Bayes algorithm that underlies the machine learning method. In essence, this limits the number of different ”profiles” of variants that the program will try to identify. This number should only be increased for datasets that include very many variants.'

    Would it be possible that adding more samples reduces the number of clusters due to regression of variant variables ?

    Issue · Github
    by Sheila

    Issue Number
    Last Updated
    Closed By
  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @lindenb @SkyWarrior

    I have seen this reported in other threads, and indeed the solution was to reduce the maxGaussians. However, I will ask Geraldine to comment more here.


  • init_jsinit_js Member
    edited July 18

    @lindenb, did reducing maxGaussians provide a workaround?

    I'm trying to remedy the same error on 4.1.2. In my case, i've had it work fine with 350 input samples (plant, whole-genome), but it fails to find bad variants when I use callsets of 600 samples or so (also whole-genome). One possibility is that adding more samples widens the range of possible qualities, and suddenly no variant is struck as being really bad (?). My recalibration of indels passes, i get the error only for SNPs.

    My model is created from a hardfiltered subset (20 samples) of all snps and indels. (Known = false, Training = true, Truth = true, Prior = Q10.0). And I recalibrate that against the full dataset.

    I have many variants to pick from:

    23:59:44.085 INFO  ProgressMeter - Traversal complete. Processed 167628980 total variants in 30.1 minutes.
    23:59:51.359 INFO  VariantDataManager - QD:          mean = 24.44    standard deviation = 6.32
    00:00:00.493 INFO  VariantDataManager - MQ:          mean = 53.78    standard deviation = 8.33
    00:00:09.192 INFO  VariantDataManager - MQRankSum:   mean = -0.01    standard deviation = 0.22
    00:00:19.572 INFO  VariantDataManager - ReadPosRankSum:      mean = 0.09     standard deviation = 0.42
    00:00:29.046 INFO  VariantDataManager - FS:          mean = 3.23     standard deviation = 5.16
    00:00:35.900 INFO  VariantDataManager - SOR:         mean = 0.72     standard deviation = 0.33
    00:00:42.700 INFO  VariantDataManager - DP:          mean = 3219.48  standard deviation = 638.81
    00:03:07.356 INFO  VariantDataManager - Annotations are now ordered by their information content: [DP, MQ, QD, FS, SOR, ReadPosRankSum, MQRankSum]
    SNP_00:03:10.907 INFO  VariantDataManager - Training with 3814814 variants after standard deviation thresholding.
    00:03:10.907 WARN  VariantDataManager - WARNING: Very large training set detected. Downsampling to 2500000 training variants.
    00:03:11.080 INFO  GaussianMixtureModel - Initializing model with 100 k-means iterations...
    00:06:35.044 INFO  VariantRecalibratorEngine - Finished iteration 0.
    00:21:00.257 INFO  VariantRecalibratorEngine - Finished iteration 60.       Current change in mixture coefficients = 0.00260
    00:21:59.511 INFO  VariantRecalibratorEngine - Convergence after 64 iterations!
    00:22:09.227 INFO  VariantRecalibratorEngine - Evaluating full set of 122545233 variants...

    other suggestions
    I've collected a few other suggestions, by sifting the forum. The general response given to resolve that issue is that "one needs at least 30 exomes or 1 whole genome" or that "not enough variants have been provided". That's the issue, but only in a general sense -- counter-intuitively, larger sample sizes skew the eligibility of a variant being marked as bad.

    1. -- gatk 3.7 thread 10798: should work on 1 whole genome.
    2. -- gatk 4.0.5 thread11110: discrepancy between the coverage of the training resource and the training dataset.
    3. -- from 2014 comment in thread 3952. Try taking out -an MQ. Others report to take out all annotations and re-add them as soon as it starts working. This may or may not have been fixed in GATK4.
    4. -- Adjust --bad-lod-score-cutoff as an argument to VariantRecalibrator. -5.0 is the default, so anything higher might allow yielding some bad variants into the model.

    What else can we try?

    Post edited by init_js on
Sign In or Register to comment.