Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

java.lang.ArrayIndexOutOfBoundsException with HaplotypeCaller 3.2.2

LinaHRLinaHR UppsalaMember

Dear Sir/Madam,

I am running HaplotypeCaller (GATK 3.2.2) and gets the error message below for some of the samples. I have seen this error posted before at the Forum and the recommendation has often been to change to a newer version of GATK. The problem is we have been using the version 3.2.2 for over 2000 samples in the same project and those will be analysed together so I am afraid to switch version for just a few samples.

I would greatly appreciate any help!

Best, Lina

ERROR ------------------------------------------------------------------------------------------
ERROR stack trace

java.lang.ArrayIndexOutOfBoundsException: 125
at org.broadinstitute.gatk.utils.sam.AlignmentUtils.calcNumHighQualitySoftClips(AlignmentUtils.java:437)
at org.broadinstitute.gatk.tools.walkers.haplotypecaller.ReferenceConfidenceModel.calcGenotypeLikelihoodsOfRefVsAny(ReferenceConfidenceModel.java:291)
at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCaller.isActive(HaplotypeCaller.java:839)
at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions.addIsActiveResult(TraverseActiveRegions.java:618)
at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions.access$800(TraverseActiveRegions.java:78)
at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions$ActiveRegionIterator.hasNext(TraverseActiveRegions.java:378)
at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:268)
at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245)
at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions.traverse(TraverseActiveRegions.java:273)
at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions.traverse(TraverseActiveRegions.java:78)
at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:99)
at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:314)
at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121)
at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248)
at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155)
at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:107)

ERROR ------------------------------------------------------------------------------------------
ERROR A GATK RUNTIME ERROR has occurred (version 3.2-2-gec30cee):
ERROR
ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
ERROR If not, please post the error message, with stack trace, to the GATK forum.
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR MESSAGE: 125
ERROR ------------------------------------------------------------------------------------------

Best Answer

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @LinaHR
    Hi Lina,

    It looks like there are too many soft clips in your reads. You can try adding -dontUseSoftClippedBases. https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_haplotypecaller_HaplotypeCaller.php#--dontUseSoftClippedBases

    -Sheila

  • LinaHRLinaHR UppsalaMember

    Dear Sheila,

    Thank you for your answer and suggestion, I will try that.

    Best, Lina

  • LinaHRLinaHR UppsalaMember

    Dear Sheila,

    I tried what you suggested, added the flag --dontUseSoftClippedBases and rerun the analysis. I am unfortunately getting the same error, see below. Do you have any other suggestions to what can be wrong? I have checked the HS metrics and flagstat for the error samples and it looks ok, but I might have missed something.

    Thank you!

    Best, Lina

    ERROR ------------------------------------------------------------------------------------------
    ERROR stack trace

    java.lang.ArrayIndexOutOfBoundsException: 125
    at org.broadinstitute.gatk.utils.sam.AlignmentUtils.calcNumHighQualitySoftClips(AlignmentUtils.java:437)
    at org.broadinstitute.gatk.tools.walkers.haplotypecaller.ReferenceConfidenceModel.calcGenotypeLikelihoodsOfRefVsAny(ReferenceConfidenceModel.java:291)
    at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCaller.isActive(HaplotypeCaller.java:839)
    at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions.addIsActiveResult(TraverseActiveRegions.java:618)
    at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions.access$800(TraverseActiveRegions.java:78)
    at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions$ActiveRegionIterator.hasNext(TraverseActiveRegions.java:378)
    at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:268)
    at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245)
    at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions.traverse(TraverseActiveRegions.java:273)
    at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions.traverse(TraverseActiveRegions.java:78)
    at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:99)
    at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:314)
    at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121)
    at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248)
    at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155)
    at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:107)

    ERROR ------------------------------------------------------------------------------------------
    ERROR A GATK RUNTIME ERROR has occurred (version 3.2-2-gec30cee):
    ERROR
    ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
    ERROR If not, please post the error message, with stack trace, to the GATK forum.
    ERROR Visit our website and forum for extensive documentation and answers to
    ERROR commonly asked questions http://www.broadinstitute.org/gatk
    ERROR
    ERROR MESSAGE: 125
    ERROR ------------------------------------------------------------------------------------------
  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @LinaHR
    Hi Lina,

    Can you post the exact command you are running? Also, please try running ValidateSamFile on you input bam file. http://broadinstitute.github.io/picard/command-line-overview.html#ValidateSamFile

    Thanks,
    Sheila

  • LinaHRLinaHR UppsalaMember

    Hi!

    Thank you again! I ran Picards ValidateSamFile on the two bam files that are causing errors and get "No errors found" for both of them. The haplotypecaller command I ran is pasted below. I also attached the complete error output from this command for one of the bam files.

    Thank you!

    Best, Lina

    java -Xmx7g -jar /sw/apps/bioinfo/GATK/3.2.2/GenomeAnalysisTK.jar -T HaplotypeCaller -R reference.fa \
    -I input.bam --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000 --dontUseSoftClippedBases \ -o output.g.vcf -rf BadCigar

  • LinaHRLinaHR UppsalaMember

    Hi again,

    A little follow up, the haplotypecaller run generates a g.vcf file output but it does not seem to be complete. I also run the ValidateVariants using the command below. The error output from this command is attached.

    Best, Lina

    java -Xmx7g -jar /sw/apps/bioinfo/GATK/3.2.2/GenomeAnalysisTK.jar -R reference.fa \
    -T ValidateVariants --validationTypeToExclude ALL --variant filename.g.vcf

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @LinaHR
    Hi Lina,

    Oh, I see. Some of the reference alleles are too long at some positions (due to long indels). Can you try running with the latest version of GATK? I think the latest version handles longer reference alleles.

    Thanks,
    Sheila

    Issue · Github
    by Geraldine_VdAuwera

    Issue Number
    260
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    vdauwera
  • LinaHRLinaHR UppsalaMember

    Thank you Sheila!

    As I mentioned above I have already analysed around 2000 samples with the version 3.2.2, and I do not have the option to rerun all samples with the latest version. The samples will be analysed together in association analysis. What problems can be caused by using different versions of the haplotypecaller?

    Thank you again!

    Best, Lina

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @LinaHR
    Hi Lina,

    Did you try running with the latest version? Do you still get the error? It is best to stick with the same version of HaplotypeCaller for all your samples, because there may have been major changes made in other versions. Have you tried to find the read(s) that are causing the error?

    -Sheila

  • LinaHRLinaHR UppsalaMember

    Hi!

    I will try running with the latest version as soon as our compute cluster is up again after maintenance. How do I go about trying to find the reads that are causing the error? Can I find clues in the error files from HaplotypeCaller or ValidateVariants?

    Thank you!

    Cheers, Lina

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    You can find clues in the progress report that is output to the terminal. It tells you how many positions have been processed so far at regular intervals, so you should be able to gauge how far the program got before the error occurred. That will be a rough estimate, but it will allow you to estimate what region to test first. Then when you run on that region, if there's no error then it means you have the wrong region, and you should try either the region before or after it. Once you find a region that throws the error, you narrow it down by e.g. bisecting the interval progressively, and eventually you should be able to identify a very short interval that reproduces the error.

  • LinaHRLinaHR UppsalaMember

    Hi!

    Thanks a lot for your answer! I also tried running the 3.4-46 version of GATK both with and without the flag --dontUseSoftClippedBases and both runs seem to have worked fine. Depending on how many samples are causing error I will decide how much effort to put into it. If it is only those 2 samples (out of 1300 for the specific cohort) I might leave them out.

    Thanks again for all your help!

    Best, Lina

Sign In or Register to comment.