"ERROR MESSAGE: No data found." from VariantRecalibrator

claybreshearsclaybreshears Hillsboro, ORPosts: 1Member

This crops up when running with "-mode INDEL". Not sure why there is no data. (See attached log file with stack trace.)

All input files are non-empty (except the .R file). A similar execution using "-mode SNP" completes with no problems. Since I'm simply looking to get the scripting and flags correct, I've used a public data set. Could it be that I'm unlucky and chose something that has no indels from the reference, which is causing the error? Could there be a more graceful method of termination?



  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,748Administrator, GATK Dev admin

    It looks like you're using a pretty small dataset, so there might be no variants in your data that overlap with the model training resources. This happens often for indels if you're running on a small dataset. The solution is to use a bigger dataset -- unfortunately it's not possible to test VQSR on small datasets.

    We're looking at ways to improve how the program handles the issues stemming from having too few variants to work with, so hopefully future versions will be more graceful.

    Geraldine Van der Auwera, PhD

  • IrantzuIrantzu Posts: 14Member
    edited July 2014

    Hi @Geraldine,
    one little question. I'm running VariantRecalibrator, and it seems that is running OK but at the end I have the "##### ERROR MESSAGE: No data found." error. I think the command is OK, but the thing is, is possible to run variantrecalibrator with 4000 variants and only ONE sample? I'm asking this because I've read several comments about this issue and I'm not sure if it is possible to run the analysis only with one sample...

    Thanks in advance

    Post edited by Irantzu on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,748Administrator, GATK Dev admin

    Hi @Irantzu,

    VQSR does not perform well (if at all) on a single sample. It can work with whole genome sequence, but if you're working with exome, there's just too few variants. Our recommendation for dealing with this is to get additional sample bams from the 1000Genomes project and add them to your callset (see this presentation for details.

    Geraldine Van der Auwera, PhD

  • quangquang Oxford UKPosts: 10Member

    Hi Geraldine,

    I put 39 million SNPs for VQSR but still got the message "##### ERROR MESSAGE: No data found.".

    Can it be the case where VQST cannot match the SNPs in the input to the training files because we do not include rsid information in the input file?

    Many thanks,
    Best regards,

  • SheilaSheila Broad InstitutePosts: 2,398Member, GATK Dev, Broadie, Moderator, DSDE Dev admin


    Hi Quang,

    Can you please post your command line and full log output.


Sign In or Register to comment.