gatk VariantRecalibrator used to work : adding more samples returns "No data found"

gatk VariantRecalibrator (v3.7) used to work on my WGS data (about 600 samples) . I've added 100 more samples (via .gvcf) to my data set and now I've got a No data found.

I'm just surprised that I get this message after adding more samples. How can I get this error when I'm 'just' adding more data to train.

INFO  09:35:00,234 VariantDataManager - QD:      mean = 19.63    standard deviation = 4.90 
INFO  09:35:00,336 VariantDataManager - FS:      mean = 1.61     standard deviation = 3.23 
INFO  09:35:00,421 VariantDataManager - SOR:     mean = 0.73     standard deviation = 0.27 
INFO  09:35:00,485 VariantDataManager - MQ:      mean = 38.37    standard deviation = 19.78 
INFO  09:35:00,549 VariantDataManager - MQRankSum:   mean = 0.02     standard deviation = 0.40 
INFO  09:35:00,612 VariantDataManager - ReadPosRankSum:      mean = 0.32     standard deviation = 0.53 
INFO  09:35:00,675 VariantDataManager - InbreedingCoeff:     mean = 0.01     standard deviation = 0.07 
INFO  09:35:01,391 VariantDataManager - Annotations are now ordered by their information content: [MQ, QD, FS, ReadPosRankSum, SOR, MQRankSum, InbreedingCoeff] 
INFO  09:35:01,425 VariantDataManager - Training with 386960 variants after standard deviation thresholding. 
INFO  09:35:01,429 GaussianMixtureModel - Initializing model with 100 k-means iterations... 
INFO  09:35:15,123 ProgressMeter -  chr18:78016669   2800393.0     8.5 m       3.0 m       99.9%     8.5 m       0.0 s 

INFO  09:39:15,324 ProgressMeter -  chr18:78016669   2800393.0    12.5 m       4.5 m       99.9%    12.5 m       0.0 s 
INFO  09:39:17,150 VariantRecalibratorEngine - Finished iteration 125.  Current change in mixture coefficients = 0.00239 
INFO  09:39:22,734 VariantRecalibratorEngine - Convergence after 128 iterations! 
INFO  09:39:23,872 VariantDataManager - Training with worst 0 scoring variants --> variants with LOD <= -5.0000. 
##### ERROR --
##### ERROR stack trace 
java.lang.IllegalArgumentException: No data found.
    at org.broadinstitute.gatk.engine.executive.Accumulator$StandardAccumulator.finishTraversal(
    at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(
    at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(
    at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(
    at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(
    at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(
    at org.broadinstitute.gatk.engine.CommandLineGATK.main(
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A GATK RUNTIME ERROR has occurred (version 3.7-0-gcfedb67):
##### ERROR
##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
##### ERROR If not, please post the error message, with stack trace, to the GATK forum.
##### ERROR Visit our website and forum for extensive documentation and answers to 
##### ERROR commonly asked questions
##### ERROR
##### ERROR MESSAGE: No data found.
##### ERROR ------------------------------------------------------------------------------------------
#executed by process    1 in 764.869s   with status 0 : JETER/targets.bash 5

Best Answer


  • SkyWarriorSkyWarrior TurkeyMember

    Can you try by reducing the number of gaussians to something less than 8? Preferably 4.

  • lindenblindenb FranceMember

    @SkyWarrior I will. But I especially want to understand how this can happen: more sample/variants -> error.

  • SkyWarriorSkyWarrior TurkeyMember
    edited November 2017

    Yep this is an interesting finding. I had a similar issue in the past. Running variant recalibrator on a low quality low number of sample set was successful however on a high number high quality sample set this failed and I could get away by reducing the gaussians to 4. I also wonder why that happens. (Although the whole recalibration was good with gaussians equal to 4. This was also INDEL mode as far as I remember. ) I got the tip from the how to's.


    'This is the maximum number of Gaussians (i.e. clusters of variants that have similar properties) that the program should try to identify when it runs the variational Bayes algorithm that underlies the machine learning method. In essence, this limits the number of different ”profiles” of variants that the program will try to identify. This number should only be increased for datasets that include very many variants.'

    Would it be possible that adding more samples reduces the number of clusters due to regression of variant variables ?

    Issue · Github
    by Sheila

    Issue Number
    Last Updated
    Closed By
  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @lindenb @SkyWarrior

    I have seen this reported in other threads, and indeed the solution was to reduce the maxGaussians. However, I will ask Geraldine to comment more here.


Sign In or Register to comment.