VQSR failed with "No data found" with Haloplex panel variant calls.

ishikuraishikura japanMember
edited November 2017 in Ask the GATK team

Hi,

Would you please teach me about behavior of GATK 3.8?
I ran VariantRecalibrator with GRCh37d5 reference and failed with "No data found".
I have three data and two of them are finished successfully with -mG 2 option.
Is it good way?
Also one data still failed with "No data found" with -mG option in Indel mode.

This is my pipeline.

${GATKCommand} -T SelectVariants -R ${Refpath}/genome.fa -V ${Outpath}/${Sample}.vcf -selectType SNP -o ${Outpath}/${Sample}.s.vcf`

${GATKCommand} -T SelectVariants -R ${Refpath}/genome.fa -V ${Outpath}/${Sample}.vcf -selectType INDEL -o ${Outpath}/${Sample}.i.vcf

${GATKCommand} -T VariantRecalibrator -R ${Refpath}/genome.fa -input ${Outpath}/${Sample}.s.vcf -resource:hapmap,known=false,training=true,truth=true,prior=15.0 ${Refpath}/hapmap_3.3.b37.vcf -resource:omni,known=false,training=true,truth=false,prior=12.0 ${Refpath}/1000G_omni2.5.b37.vcf -resource:1000G,known=false,training=true,truth=false,prior=10.0 ${Refpath}/1000G_phase1.snps.high_confidence.b37.vcf -resource:dbsnp,known=true,training=false,truth=false,prior=6.0 ${Refpath}/dbsnp146.vcf -an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR -mG 2 -mode SNP -recalFile ${Outpath}/${Sample}.s.recal -tranchesFile ${Outpath}/${Sample}.s.tranches -rscriptFile ${Outpath}/${Sample}.s.plots.R

${GATKCommand} -T ApplyRecalibration -R ${Refpath}/genome.fa -input ${Outpath}/${Sample}.s.vcf -tranchesFile ${Outpath}/${Sample}.s.tranches -recalFile ${Outpath}/${Sample}.s.recal --ts_filter_level 99.0 -mode SNP -o ${Outpath}/${Sample}.s.recal.vcf

${GATKCommand} -T VariantRecalibrator -R ${Refpath}/genome.fa -input ${Outpath}/${Sample}.i.vcf -resource:mills,known=true,training=true,truth=true,prior=12.0 ${Refpath}/Mills_and_1000G_gold_standard.indels.b37.vcf -an QD -an DP -an FS -an SOR -an MQRankSum -an ReadPosRankSum -mG 2 -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.04 -mode INDEL -recalFile ${Outpath}/${Sample}.i.recal -tranchesFile ${Outpath}/${Sample}.i.tranches -rscriptFile ${Outpath}/${Sample}.i.plots.R

${GATKCommand} -T ApplyRecalibration -R ${Refpath}/genome.fa -input ${Outpath}/${Sample}.i.vcf -tranchesFile ${Outpath}/${Sample}.i.tranches -recalFile ${Outpath}/${Sample}.i.recal --ts_filter_level 99.0 -mode INDEL -o ${Outpath}/${Sample}.i.recal.vcf

Error message is below.

    ##### ERROR --
    ##### ERROR stack trace
    java.lang.IllegalArgumentException: No data found.
            at org.broadinstitute.gatk.tools.walkers.variantrecalibration.VariantRecalibratorEngine.generateModel(VariantRecalibratorEngine.java:88)
            at org.broadinstitute.gatk.tools.walkers.variantrecalibration.VariantRecalibrator.onTraversalDone(VariantRecalibrator.java:536)
            at org.broadinstitute.gatk.tools.walkers.variantrecalibration.VariantRecalibrator.onTraversalDone(VariantRecalibrator.java:191)
            at org.broadinstitute.gatk.engine.executive.Accumulator$StandardAccumulator.finishTraversal(Accumulator.java:129)
            at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:115)
            at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:323)
            at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:123)
            at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:256)
            at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:158)
            at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:108)
    ##### ERROR ------------------------------------------------------------------------------------------
    ##### ERROR A GATK RUNTIME ERROR has occurred (version 3.8-0-ge9d806836):
    ##### ERROR
    ##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
    ##### ERROR If not, please post the error message, with stack trace, to the GATK forum.
    ##### ERROR Visit our website and forum for extensive documentation and answers to
    ##### ERROR commonly asked questions https://software.broadinstitute.org/gatk
    ##### ERROR
    ##### ERROR MESSAGE: No data found.
    ##### ERROR ------------------------------------------------------------------------------------------

Thank you for advance.

Ishi

Best Answers

Answers

  • ishikuraishikura japanMember

    Hi Sheila,

    Thank you for reply.
    Yes, I had already check some threads.
    But we can't merge vcf files because of current pipeline structure.
    In such case, shoud we use Hard Filter with VariantFiltration instead of VariantRecalibrator ?

    Thanks,
    Takashi

  • shleeshlee CambridgeMember, Broadie, Moderator admin
    edited November 2017

    Hi @ishikura,

    Sheila is traveling for a workshop. I am not familiar with VQSR's options so I'll have to consult with a developer. Before I do, can you clarify what you mean by merging VCF files? That is, can you explain your pipeline to us so we can understand what is the limitation. E.g. is your pipeline scattering across contigs and it is that these VCFs cannot be consolidated?

    In GATK4, there is a tool GatherTranches, that will

    Gathers scattered VQSLOD tranches into a single file.

    The tool doc is at https://software.broadinstitute.org/gatk/gatkdocs/4.beta.6/org_broadinstitute_hellbender_tools_walkers_vqsr_GatherTranches.php.

    Also, can you please tell us more about your experimental design?

    I have three data and two of them are finished successfully with -mG 2 option.
    Is it good way?
    Also one data still failed with "No data found" with -mG option in Indel mode.

    Do you have WGS or exome data? Are you referring to data files or samples in your comment? Thanks.

  • ishikuraishikura japanMember

    Hi Shlee,

    Thank you for comment.
    My pipeline is not for my own, but my customer use previously.
    The pipeline not make consolidated vcf, but map data and call variant
    for samples one by one.

    I will try to use consolidated vcf made by GATK CombineVariants. and
    report here.

    I have some exome data and it failes sometimes as same way but less
    frequently than Haloplox panel data.
    There is no experimental design, becouse this scripts are for core lab
    and the analysis done by batch one by one.

    Thanks,
    Takashi

  • ishikuraishikura japanMember

    I tried to consolidate three vcf files to one vcf.
    But it fails same "no data found" error in VariantRecalibrator for INDEL.

    Thanks,
    Takashi

  • shleeshlee CambridgeMember, Broadie, Moderator admin

    @ishikura, since it sounds like your VCF may be unconventionally formatted, let's see what ValidateVariants has to say about your VCF. Can you let us know?

  • ishikuraishikura japanMember
    edited November 2017

    Hi shlee,

    Thank you for comment.
    I ran ValidateVariants with no option and get no error.
    But when I ran ValidateVariants in strict mode, no rsID error occurs. Is this cause something wrong in VQSR ?

     java -Xmx16g -jar /data/tools/GenomeAnalysisTK-3.8/GenomeAnalysisTK.jar \
       -T ValidateVariants \
       -R /data/ref/GRCh37d5/genome.fa  \
       -V merged.vcf \
       --dbsnp /data/ref/GRCh37d5/archive/All_20170710.vcf.gz
    

    ##### ERROR MESSAGE: File /data/handai/Haloplex/vcf38merge/merged.vcf fails strict validation: the rsID rs111480478 for the record at position 1:143277670 is not in dbSNP

    Thanks,
    Takashi

  • ishikuraishikura japanMember

    Hi shlee,

    Thank you for respone.
    I understand I can ignore rsID error.

    My error is occur at VariantRecalibrator. So it comes from lack of data.
    I will discuss pipeline structure with team.
    Also, thank you for valuable link for hard filtering.

    Regards,
    Takashi

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @ishikura
    Hi Takashi,

    Yes, in your case, it sounds like you have 3 VCFs that contains variants from exomes. We recommend using at least 30 exomes in VQSR, as the tool needs to see a lot of data to make good models. You will be better off using hard filtering. The docs Soo Hee pointed you to should help.

    Good luck.

    -Sheila

Sign In or Register to comment.