"ERROR MESSAGE: No data found." from VariantRecalibrator

claybreshearsclaybreshears Hillsboro, ORPosts: 1Member

This crops up when running with "-mode INDEL". Not sure why there is no data. (See attached log file with stack trace.)

All input files are non-empty (except the .R file). A similar execution using "-mode SNP" completes with no problems. Since I'm simply looking to get the scripting and flags correct, I've used a public data set. Could it be that I'm unlucky and chose something that has no indels from the reference, which is causing the error? Could there be a more graceful method of termination?

log
log
NIST7035_TAAGGCGA_L001_R1_001.recalibrate.indel.log
8K

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 9,347Administrator, Dev admin

    It looks like you're using a pretty small dataset, so there might be no variants in your data that overlap with the model training resources. This happens often for indels if you're running on a small dataset. The solution is to use a bigger dataset -- unfortunately it's not possible to test VQSR on small datasets.

    We're looking at ways to improve how the program handles the issues stemming from having too few variants to work with, so hopefully future versions will be more graceful.

    Geraldine Van der Auwera, PhD

  • IrantzuIrantzu Posts: 14Member
    edited July 2014

    Hi @Geraldine,
    one little question. I'm running VariantRecalibrator, and it seems that is running OK but at the end I have the "##### ERROR MESSAGE: No data found." error. I think the command is OK, but the thing is, is possible to run variantrecalibrator with 4000 variants and only ONE sample? I'm asking this because I've read several comments about this issue and I'm not sure if it is possible to run the analysis only with one sample...

    Thanks in advance

    Post edited by Irantzu on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 9,347Administrator, Dev admin

    Hi @Irantzu,

    VQSR does not perform well (if at all) on a single sample. It can work with whole genome sequence, but if you're working with exome, there's just too few variants. Our recommendation for dealing with this is to get additional sample bams from the 1000Genomes project and add them to your callset (see this presentation for details.

    Geraldine Van der Auwera, PhD

  • quangquang Oxford UKPosts: 10Member

    Hi Geraldine,

    I put 39 million SNPs for VQSR but still got the message "##### ERROR MESSAGE: No data found.".

    Can it be the case where VQST cannot match the SNPs in the input to the training files because we do not include rsid information in the input file?

    Many thanks,
    Best regards,
    Quang.

  • SheilaSheila Broad InstitutePosts: 2,678Member, Broadie, Moderator, Dev admin

    @quang‌

    Hi Quang,

    Can you please post your command line and full log output.

    Thanks,
    Sheila

  • seruseru BergenPosts: 35Member ✭✭

    Hi and Happy New Year,

    I will post my logs as I am getting the same error. A brief background first. It can hopefully shed more light on this misterious issue. We continuously keep generating exome data in batches of 8 samples (1 NextSeq run), and genotype every new batch (using HC; GATK 3.3.0) with all exomes previously sequenced on the same platform/capture kit. So the set of joint-called samples grows gradually. In the beginning there was no problems with VQSR (we started from >30 exomes). When we crossed 80 exomes I got the 'No data found' exception first time (for SNPs). I removed MQ annotation from SNP VQSR (we use BWA MEM) and disabled multithreading, as suggested in multiple posts on this forum. This helped. Now, approaching 170 exomes, I got the same problem for INDELs. VQSR for prior run (8 exomes less) worked fine. Replacing the problematic run with 8 exomes sequenced on a different platform also didn't result in this exception. What could be the problem here? It can't be too little input data. When I removed the MQRankSum annotation (the least informative one) from INDEL VQSR step, it went fine. Is the data becoming too homogenous for the model as the samples accumulate?

    Any input is appreciated. Best regards,
    Pawel

    Here is my the error, with the args used:

       INFO  19:13:34,375 HelpFormatter - --------------------------------------------------------------------------------
        INFO  19:13:34,377 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.3-0-g37228af, Compiled 2014/10/24 01:07:22
        INFO  19:13:34,377 HelpFormatter - Copyright (c) 2010 The Broad Institute
        INFO  19:13:34,377 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk
        INFO  19:13:34,380 HelpFormatter - Program Args: -T VariantRecalibrator -R /persistent/diagnostic/reference/g1k_v37/human_g1k_v37.fasta -input /scratch/diagnostics/160107_NS500635_0061_AHLYH5BGXX/160107_NS500635_0061_AHLYH5BGXX.multisample.vcf --maxGaussians 4 -resource:mills,known=false,training=true,truth=true,prior=12.0 /persistent/diagnostic/reference/g1k_v37/Mills_and_1000G_gold_standard.indels.b37.vcf -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 /persistent/diagnostic/reference/g1k_v37/dbsnp_138.b37.vcf -an QD -an DP -an FS -an ReadPosRankSum -an MQRankSum -mode INDEL -recalFile /scratch/diagnostics/160107_NS500635_0061_AHLYH5BGXX/160107_NS500635_0061_AHLYH5BGXX.multisample.indel.model -tranchesFile /scratch/diagnostics/160107_NS500635_0061_AHLYH5BGXX/160107_NS500635_0061_AHLYH5BGXX.multisample.indel.model.tranches -rscriptFile /scratch/diagnostics/160107_NS500635_0061_AHLYH5BGXX/160107_NS500635_0061_AHLYH5BGXX.multisample.indel.model.plots.R
        INFO  19:13:34,385 HelpFormatter - Executing as ?@d369c9a76972 on Linux 3.16.0-56-generic amd64; OpenJDK 64-Bit Server VM 1.7.0_91-b02.
        INFO  19:13:34,386 HelpFormatter - Date/Time: 2016/01/09 19:13:34
        INFO  19:13:34,386 HelpFormatter - --------------------------------------------------------------------------------
        INFO  19:13:34,386 HelpFormatter - --------------------------------------------------------------------------------
        INFO  19:13:36,014 GenomeAnalysisEngine - Strictness is SILENT
        INFO  19:13:36,269 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000
        INFO  19:13:39,358 GenomeAnalysisEngine - Preparing for traversal
        INFO  19:13:39,377 GenomeAnalysisEngine - Done preparing for traversal
        INFO  19:13:39,378 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
        INFO  19:13:39,378 ProgressMeter -                 | processed |    time |    per 1M |           |   total | remaining
        INFO  19:13:39,379 ProgressMeter -        Location |     sites | elapsed |     sites | completed | runtime |   runtime
        INFO  19:13:39,389 TrainingSet - Found mills track:         Known = false   Training = true         Truth = true    Prior = Q12.0
        INFO  19:13:39,390 TrainingSet - Found dbsnp track:         Known = true    Training = false        Truth = false   Prior = Q2.0
        INFO  19:14:09,423 ProgressMeter -      1:37206397    919122.0    30.0 s      32.0 s        1.2%    41.7 m      41.2 m
        INFO  19:14:39,451 ProgressMeter -      1:83817776   1954620.0    60.0 s      30.0 s        2.7%    37.0 m      36.0 m
        INFO  19:15:09,453 ProgressMeter -     1:153472711   3014298.0    90.0 s      29.0 s        4.9%    30.3 m      28.8 m
        INFO  19:15:39,455 ProgressMeter -     1:196374572   4005249.0   120.0 s      29.0 s        6.3%    31.6 m      29.6 m
        INFO  19:16:09,457 ProgressMeter -     1:248906839   5271675.0     2.5 m      28.0 s        8.0%    31.2 m      28.7 m
        INFO  19:16:39,459 ProgressMeter -      2:61044625   6836862.0     3.0 m      26.0 s       10.0%    30.0 m      27.0 m
        INFO  19:17:09,461 ProgressMeter -     2:137506314   8502246.0     3.5 m      24.0 s       12.5%    28.1 m      24.6 m
        INFO  19:17:39,462 ProgressMeter -     2:212393706   1.0140715E7     4.0 m      23.0 s       14.9%    26.9 m      22.9 m
        INFO  19:18:09,464 ProgressMeter -      3:36534129   1.1838863E7     4.5 m      22.0 s       17.1%    26.4 m      21.9 m
        INFO  19:18:39,466 ProgressMeter -     3:108996601   1.3467792E7     5.0 m      22.0 s       19.4%    25.8 m      20.8 m
        INFO  19:19:09,467 ProgressMeter -     3:178223504   1.5068286E7     5.5 m      21.0 s       21.6%    25.4 m      19.9 m
        INFO  19:19:39,469 ProgressMeter -      4:45498849   1.6747321E7     6.0 m      21.0 s       23.7%    25.3 m      19.3 m
        INFO  19:20:09,503 ProgressMeter -     4:120858180   1.8435624E7     6.5 m      21.0 s       26.2%    24.9 m      18.4 m
        INFO  19:20:39,504 ProgressMeter -     4:191042026   2.0137218E7     7.0 m      20.0 s       28.4%    24.6 m      17.6 m
        INFO  19:21:09,506 ProgressMeter -      5:74997042   2.1830903E7     7.5 m      20.0 s       30.8%    24.3 m      16.8 m
        INFO  19:21:39,507 ProgressMeter -     5:147999622   2.3497481E7     8.0 m      20.0 s       33.2%    24.1 m      16.1 m
        INFO  19:22:09,509 ProgressMeter -      6:33400096   2.5247443E7     8.5 m      20.0 s       35.3%    24.1 m      15.6 m
        INFO  19:22:39,510 ProgressMeter -     6:108996084   2.6958351E7     9.0 m      20.0 s       37.8%    23.8 m      14.8 m
        INFO  19:23:09,512 ProgressMeter -       7:8471762   2.8733513E7     9.5 m      19.0 s       40.0%    23.7 m      14.2 m
        INFO  19:23:39,513 ProgressMeter -      7:80996573   3.0443304E7    10.0 m      19.0 s       42.4%    23.6 m      13.6 m
        INFO  19:24:09,514 ProgressMeter -     7:155865502   3.2161014E7    10.5 m      19.0 s       44.8%    23.4 m      12.9 m
        INFO  19:24:39,516 ProgressMeter -      8:60995618   3.3900585E7    11.0 m      19.0 s       46.9%    23.5 m      12.5 m
        INFO  19:25:09,517 ProgressMeter -     8:137497616   3.5630237E7    11.5 m      19.0 s       49.3%    23.3 m      11.8 m
        INFO  19:25:39,519 ProgressMeter -      9:82999224   3.7336294E7    12.0 m      19.0 s       52.3%    22.9 m      10.9 m
        INFO  19:26:09,520 ProgressMeter -     10:11798385   3.9083434E7    12.5 m      19.0 s       54.6%    22.9 m      10.4 m
        INFO  19:26:39,522 ProgressMeter -     10:86998079   4.0813704E7    13.0 m      19.0 s       57.0%    22.8 m       9.8 m
        INFO  19:27:09,523 ProgressMeter -     11:22502054   4.2574642E7    13.5 m      19.0 s       59.3%    22.8 m       9.3 m
        INFO  19:27:39,524 ProgressMeter -     11:94501691   4.4270577E7    14.0 m      18.0 s       61.6%    22.7 m       8.7 m
        INFO  19:28:09,526 ProgressMeter -     12:30468189   4.6004444E7    14.5 m      18.0 s       63.9%    22.7 m       8.2 m
        INFO  19:28:39,527 ProgressMeter -    12:106996274   4.772222E7    15.0 m      18.0 s       66.3%    22.6 m       7.6 m
        INFO  19:29:09,528 ProgressMeter -     13:63576415   4.9475787E7    15.5 m      18.0 s       69.3%    22.4 m       6.9 m
        INFO  19:29:39,530 ProgressMeter -     14:36593796   5.1148678E7    16.0 m      18.0 s       72.1%    22.2 m       6.2 m
        INFO  19:30:09,531 ProgressMeter -    14:106998390   5.2817001E7    16.5 m      18.0 s       74.4%    22.2 m       5.7 m
        INFO  19:30:39,533 ProgressMeter -     15:94586228   5.4591134E7    17.0 m      18.0 s       77.4%    22.0 m       5.0 m
        INFO  19:31:09,534 ProgressMeter -     16:69813419   5.6323579E7    17.5 m      18.0 s       79.9%    21.9 m       4.4 m
        INFO  19:31:39,536 ProgressMeter -     17:47063825   5.8105082E7    18.0 m      18.0 s       82.1%    21.9 m       3.9 m
        INFO  19:32:09,537 ProgressMeter -     18:40000044   5.9843335E7    18.5 m      18.0 s       84.5%    21.9 m       3.4 m
    INFO  19:32:39,538 ProgressMeter -     19:30653797   6.1586296E7    19.0 m      18.0 s       86.7%    21.9 m       2.9 m
        INFO  19:33:09,540 ProgressMeter -     20:39996130   6.3311329E7    19.5 m      18.0 s       88.9%    21.9 m       2.4 m
        INFO  19:33:39,541 ProgressMeter -     22:22114820   6.5077829E7    20.0 m      18.0 s       91.9%    21.8 m     105.0 s
        INFO  19:34:09,543 ProgressMeter -      X:64999443   6.6840196E7    20.5 m      18.0 s       95.0%    21.6 m      65.0 s
        INFO  19:34:39,437 VariantDataManager - QD:          mean = 16.75    standard deviation = 7.85
        INFO  19:34:39,442 VariantDataManager - DP:          mean = 8812.68  standard deviation = 5706.10
        INFO  19:34:39,447 VariantDataManager - FS:          mean = 2.61     standard deviation = 7.96
        INFO  19:34:39,452 VariantDataManager - ReadPosRankSum:      mean = 0.12     standard deviation = 0.72
        INFO  19:34:39,458 VariantDataManager - MQRankSum:   mean = 0.24     standard deviation = 0.82
        INFO  19:34:39,544 ProgressMeter -        MT:16311   6.8555979E7    21.0 m      18.0 s       99.8%    21.0 m       2.0 s
        INFO  19:34:39,552 VariantDataManager - Annotations are now ordered by their information content: [DP, QD, FS, ReadPosRankSum, MQRankSum]
        INFO  19:34:39,553 VariantDataManager - Training with 2982 variants after standard deviation thresholding.
        INFO  19:34:39,557 GaussianMixtureModel - Initializing model with 100 k-means iterations...
        INFO  19:34:39,808 VariantRecalibratorEngine - Finished iteration 0.
        INFO  19:34:40,024 VariantRecalibratorEngine - Finished iteration 5.        Current change in mixture coefficients = 0.47109
        INFO  19:34:40,071 VariantRecalibratorEngine - Finished iteration 10.       Current change in mixture coefficients = 0.06015
        INFO  19:34:40,107 VariantRecalibratorEngine - Finished iteration 15.       Current change in mixture coefficients = 0.02086
        INFO  19:34:40,138 VariantRecalibratorEngine - Finished iteration 20.       Current change in mixture coefficients = 0.01137
        INFO  19:34:40,169 VariantRecalibratorEngine - Finished iteration 25.       Current change in mixture coefficients = 0.00532
        INFO  19:34:40,181 VariantRecalibratorEngine - Convergence after 27 iterations!
        INFO  19:34:40,212 VariantDataManager - Training with worst 0 scoring variants --> variants with LOD <= -5.0000.
        INFO  19:35:09,546 ProgressMeter -        MT:16311   6.8555979E7    21.5 m      18.0 s       99.8%    21.5 m       2.0 s
        ##### ERROR ------------------------------------------------------------------------------------------
        ##### ERROR stack trace
        java.lang.IllegalArgumentException: No data found.
            at org.broadinstitute.gatk.tools.walkers.variantrecalibration.VariantRecalibratorEngine.generateModel(VariantRecalibratorEngine.java:88)
            at org.broadinstitute.gatk.tools.walkers.variantrecalibration.VariantRecalibrator.onTraversalDone(VariantRecalibrator.java:399)
            at org.broadinstitute.gatk.tools.walkers.variantrecalibration.VariantRecalibrator.onTraversalDone(VariantRecalibrator.java:143)
            at org.broadinstitute.gatk.engine.executive.Accumulator$StandardAccumulator.finishTraversal(Accumulator.java:129)
            at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:116)
            at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:319)
            at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121)
            at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248)
            at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155)
            at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:107)
        ##### ERROR ------------------------------------------------------------------------------------------
        ##### ERROR A GATK RUNTIME ERROR has occurred (version 3.3-0-g37228af):
        ##### ERROR
        ##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
        ##### ERROR If not, please post the error message, with stack trace, to the GATK forum.
        ##### ERROR Visit our website and forum for extensive documentation and answers to
        ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk
        ##### ERROR
        ##### ERROR MESSAGE: No data found.
        ##### ERROR ------------------------------------------------------------------------------------------
    
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 9,347Administrator, Dev admin

    Interesting -- could you please run VQSR again with GATK 3.5? There were some changes made to how GATK handles some annotations including MQ. I think those changes should help with this type of problem so I'd be curious to know if you see an improvement.

    Geraldine Van der Auwera, PhD

  • seruseru BergenPosts: 35Member ✭✭

    Thank you for the promt reply. We use GATK as part of a production system/pipeline, and for consistency and stability we would like to avoid changing software version as much as possible. Migrating to 3.5 is not as easy as replacing the jar, and will require some more testing. I could test if it runs just for the sake of checking, but unless we are really forced to do so, we will most likely not upgrade at the moment.

    Are there any other possibilities?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 9,347Administrator, Dev admin

    I completely understand that upgrading your production pipeline may not be immediately possible -- but it would be helpful to know if the changes address your problem or not. Depending on the answer our recommendations for dealing with the problem at hand may be different.

    Geraldine Van der Auwera, PhD

  • seruseru BergenPosts: 35Member ✭✭

    OK, I will give it a try and get back to you when I have more information. Cheers, Paweł

Sign In or Register to comment.