Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
Attention:
We will be out of the office on November 11th and 13th 2019, due to the U.S. holiday(Veteran's day) and due to a team event(Nov 13th). We will return to monitoring the GATK forum on November 12th and 14th respectively. Thank you for your patience.

"java.lang.ArrayIndexOutOfBoundsException" in GenotypeGVCFs when multi-threading

FerFer AustriaMember
edited December 2014 in Ask the GATK team

Hi, I'm using GATKv3.3 to run the protocol of HC followed by CombineGVCFs in six batches of ~190 individuals each and finally GenotypeGVCFs of these six gVCFs. My failed command is below followed by the error message. In addition, I can tell you that my output file (1135g_.vcf) is empty even when it crashed after more than two hours of running time.

Kindly, Alberto

java -Djava.io.tmpdir=$mytmp -Xmx232g -jar $EBROOTGATK/GenomeAnalysisTK.jar -R $ref -T GenotypeGVCFs -nt 40 -o 1135g_.vcf -V 1135g_listof.list

ERROR ------------------------------------------------------------------------------------------

ERROR stack trace
java.lang.ArrayIndexOutOfBoundsException: 10000 at org.broadinstitute.gatk.utils.variant.ReferenceConfidenceVariantContextMerger.generatePL(ReferenceConfidenceVariantContextMerger.java:357) at org.broadinstitute.gatk.utils.variant.ReferenceConfidenceVariantContextMerger.mergeRefConfidenceGenotypes(ReferenceConfidenceVariantContextMerger.jav a:331) at org.broadinstitute.gatk.utils.variant.ReferenceConfidenceVariantContextMerger.merge(ReferenceConfidenceVariantContextMerger.java:134) at org.broadinstitute.gatk.tools.walkers.variantutils.GenotypeGVCFs.map(GenotypeGVCFs.java:200) at org.broadinstitute.gatk.tools.walkers.variantutils.GenotypeGVCFs.map(GenotypeGVCFs.java:119) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:267) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:255) at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274) at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:144) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:92) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:48) at org.broadinstitute.gatk.engine.executive.ShardTraverser.call(ShardTraverser.java:98) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:722)

ERROR ------------------------------------------------------------------------------------------
ERROR A GATK RUNTIME ERROR has occurred (version 3.3-0-g37228af):
ERROR
ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
ERROR If not, please post the error message, with stack trace, to the GATK forum.
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR MESSAGE: 10000
ERROR ------------------------------------------------------------------------------------------

Tagged:

Issue · Github
by Geraldine_VdAuwera

Issue Number
856
State
closed
Last Updated
Closed By
vdauwera

Best Answer

Answers

  • FerFer AustriaMember

    As suggested by @Geraldine_VdAuwera‌, I'm running it without multi-threading (-nt 40) and it seems going well and printing data on the files so far. However, the speed it too slow (<1 Megabase/hr). I split by chromosome, but still it would be nice to make use of -nt.
    G'day

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @Fer, you may be able to strike a compromise by still using -nt but at a lower value. GATK tools, unlike pillow cases, tend to fare better with lower thread counts.

  • FerFer AustriaMember

    Dear @Geraldine_VdAuwera. Perhaps I've sent a premature update. The problem persists even after removing -nt.

    As I've mentioned I split it by chromosomes, and the error occurred at one of them after processing ~20% of it. The ERROR MESSAGE is almost identical, but I'm going to paste it below cause it shows some differences towards the end. Could it be an error in an input file? I don't see anything unusual in the logs of the CombineGVCFs part for any of the batches.

    Kind regards.

    ERROR stack trace

    java.lang.ArrayIndexOutOfBoundsException: 10000
    at org.broadinstitute.gatk.utils.variant.ReferenceConfidenceVariantContextMerger.generatePL(ReferenceConfidenceVariantContextMerger.java:357)
    at org.broadinstitute.gatk.utils.variant.ReferenceConfidenceVariantContextMerger.mergeRefConfidenceGenotypes(ReferenceConfidenceVariantContextMerger.jav
    a:331)
    at org.broadinstitute.gatk.utils.variant.ReferenceConfidenceVariantContextMerger.merge(ReferenceConfidenceVariantContextMerger.java:134)
    at org.broadinstitute.gatk.tools.walkers.variantutils.GenotypeGVCFs.map(GenotypeGVCFs.java:200)
    at org.broadinstitute.gatk.tools.walkers.variantutils.GenotypeGVCFs.map(GenotypeGVCFs.java:119)
    at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:267)
    at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:255)
    at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274)
    at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245)
    at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:144)
    at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:92)
    at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:48)
    at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:99)
    at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:319)
    at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121)
    at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248)
    at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155)
    at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:107)

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    The differences in the stack trace are due to the removal of the -nt argument, which lead to slightly different internal path through the engine, but it looks like the error occurs in the same place in terms of processing operations.

    Were the GVCFs also generated with version 3.3 or do they originate from an older version?

    Also, can you narrow down the error to a specific slice of your dataset?

  • FerFer AustriaMember

    Hi again, I'm almost completely sure I've re-run everything under GATKv3.3. The previous error using -nt seems to narrow down to the same area in the same chromosome as without that parameter. However, I don't see something obviously wrong in the intermediate files (I mean the CombineGVCFs step).

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hmm. If you run GenotypeGVCFs separately on each combined GVCF, on that problem region, does it still error out?

  • FerFer AustriaMember

    Hi. Took a while to run it but didn't want to write again with premature results. When I did GenotypeGVCFs (without -nt argument) in each combined GVCF, 4 out the 6 batches crashed again at approximately the same region. However, when I ran GenotypeGVCFs directly on the individual gVCFs (~190 each batch) everything went smoothly.
    What could it be?

  • SheilaSheila Broad InstituteMember, Broadie admin

    @Fer‌

    Hi,

    We have had some users complain of issues with CombineGVCFs. There may be an indel in the region that is causing the crash. Can you please check, and let us know if there is indeed an indel?

    Thanks,
    Sheila

  • FerFer AustriaMember
    edited December 2014

    I don't see anything in common around the last reported variant in the outcome of GenotypeGVCFs. It is always the same region though not exactly the same position. In only 1 out of the 4 crashes this last position in an INDEL; however, I guess that last position might be alright since it made it to the VCF.

    Going an step back to the outcome of CombineGVCFs that later caused the problems, which is what I believe @Sheila asked me to check, I can see that the next variant that didn't make it to the vcf was also not an INDEL. It was always a T -> C though, but I don't see what might be wrong with it.

    Here I report the context of the last variant reported by GenotypeGVCFs in its input Combined gVCF for 4 different batches. Positions 4279274, 4279117, 4279186 and 4279147, respectively:

    Chr2 4279274 . T C, . . DP=10091;MQ=58.31;MQ0=0 GT:AD:DP:MIN_DP:PL:SB
    Chr2 4279275 . T C, . . BaseQRankSum=1.82;ClippingRankSum=-2.100e-01;DP=10059;MQ=57.71;MQ0=0;MQRankSum=-2.275e+00;ReadPosRankSum=-6.600e-01 GT:AD:DP:MIN_DP:PGT:PID:PL:SB
    Chr2 4279276 . A . . . GT:DP:GQ:MIN_DP:PL

    Chr2 4279117 . C T, . . BaseQRankSum=-1.745e+00;ClippingRankSum=0.237;DP=11509;MQ=59.96;MQ0=0;MQRankSum=-6.000e-03;ReadPosRankSum=0.274 GT:AD:DP:MIN_DP:PGT:PID:PL:SB
    Chr2 4279118 . T . . . GT:DP:GQ:MIN_DP:PL
    Chr2 4279119 . T C, . . BaseQRankSum=-1.061e+00;ClippingRankSum=0.476;DP=11589;MQ=60.00;MQ0=0;MQRankSum=0.476;ReadPosRankSum=-8.420e-01 GT:AD:DP:MIN_DP:PL:SB

    Chr2 4279186 . C T, . . BaseQRankSum=-7.720e-01;ClippingRankSum=0.282;DP=10784;MQ=56.23;MQ0=0;MQRankSum=-3.150e+00;ReadPosRankSum=0.919 GT:AD:DP:MIN_DP:PL:SB
    Chr2 4279187 . T C,G, . . BaseQRankSum=-7.800e-02;ClippingRankSum=-3.120e-01;DP=10732;MQ=60.00;MQ0=0;MQRankSum=-2.600e-01;ReadPosRankSum=0.358 GT:AD:DP:MIN_DP:PGT:PID:PL:SB
    Chr2 4279188 . T G, . . BaseQRankSum=-1.189e+00;ClippingRankSum=0.405;DP=10771;MQ=56.23;MQ0=0;MQRankSum=-3.616e+00;ReadPosRankSum=0.601 GT:AD:DP:MIN_DP:PL:SB

    Chr2 4279147 . ATCTCTTGTGCATCCAGTTTGAACTTCTCAAT A, . . BaseQRankSum=-2.704e+00;ClippingRankSum=0.724;DP=11685;MQ=59.87;MQ0=0;MQRankSum=-1.420e-01;ReadPosRankSum=-1.600e-02 GT:AD:DP:MIN_DP:PGT:PID:PL:SB
    Chr2 4279148 . T C, . . BaseQRankSum=-1.733e+00;ClippingRankSum=-1.910e-01;DP=11370;MQ=60.00;MQ0=0;MQRankSum=-1.790e-01;ReadPosRankSum=0.606 GT:AD:DP:MIN_DP:PGT:PID:PL:SB
    Chr2 4279149 . C T, . . BaseQRankSum=-1.433e+00;ClippingRankSum=0.620;DP=11724;MQ=60.00;MQ0=0;MQRankSum=0.075;ReadPosRankSum=-1.700e-02 GT:AD:DP:MIN_DP:PGT:PID:PL:SB

  • SheilaSheila Broad InstituteMember, Broadie admin

    @Fer‌

    Hello,

    If you can, please upload some snippets of your files so we can debug them locally. Instructions for how to do so are here: http://gatkforums.broadinstitute.org/discussion/1894/how-do-i-submit-a-detailed-bug-report

    Thanks,
    Sheila

  • FerFer AustriaMember
    edited December 2014

    Hi,
    I've uploaded an snippet of the Combined gVFC that later made GenotypeGVCFs fail. The error reproduces, and the command can be found in command_GenotypeGVCFs_Fer.sh

    File name: ArrayIndexOutOfBoundsException_GenotypeGVCFs_Fer.tgz

    Cheers and merry Christmas to the GATK team!

  • SheilaSheila Broad InstituteMember, Broadie admin

    @Fer‌

    Hi,

    Thanks. We will look into this and get back to you when it is fixed. As I mentioned in my other response to your bug for ploidy limitation, this may take some extra time, as it is the holiday season.

    Merry Christmas!

    -Sheila

  • FerFer AustriaMember

    Dear GATK team, sorry for the remainder, but anything on this matter?

  • SheilaSheila Broad InstituteMember, Broadie admin

    @Fer‌

    Hi,

    Thanks for the reminder. This and your other bug report fell off my radar. I just submitted a bug report for this issue and will let you know when it is fixed. It looks like an issue with the high number of alternate alleles at one site.

    -Sheila

  • dtaliundtaliun Member

    Hi,

    Do you have any updates about this bug? I run into exactly the same problem on v3.3.0 without using -nt.

    Thanks,
    Daniel

  • SheilaSheila Broad InstituteMember, Broadie admin

    @dtaliun
    Hi Daniel,

    Unfortunately, our developers have not gotten to fixing this yet. They have been busy with other higher priorities. I will post when it is fixed.

    -Sheila

  • dtaliundtaliun Member

    Hi Sheila,

    This bug is due to the large number (hundreds) of alternative alleles in the processed loci. Assuming there are N alleles in diploid organism, then VCF file stores N * (N - 1) / 2 likelihoods in the PL field. However, the samtools:htsjdk:1.120.1620 library for VCF parsing that is used in GATK 3.3.0 reads only first 10,000 likelihoods:

    LINE 755-766 in htsjdk.variant.vcf.AbstractVCFCoded.java:

    private final String[] INT_DECODE_ARRAY = new String[10000];
    private final int[] decodeInts(final String string) {
    final int nValues = ParsingUtils.split(string, INT_DECODE_ARRAY, ',');
    final int[] values = new int[nValues];
    try {
    for ( int i = 0; i < nValues; i++ )
    values[i] = Integer.valueOf(INT_DECODE_ARRAY[i]);
    } catch (final NumberFormatException e) {
    return null;
    }
    return values;
    }

    Since GATK assumes that there are N * (N - 1) / 2 likelihoods, then it eventually jumps out of the 10,000 array.
    The latest 1.130 version of samtools:htsjdk still has 10000 likelihoods limitation.

    Best,
    Daniel

    Issue · Github
    by Geraldine_VdAuwera

    Issue Number
    937
    State
    closed
    Last Updated
    Assignee
    Array
    Closed By
    vdauwera
  • SheilaSheila Broad InstituteMember, Broadie admin

    @dtaliun
    Hi Daniel,

    Wow. Thank you for looking into this. I will bring this up at the next meeting.

    -Sheila

  • KurtKurt Member ✭✭✭

    Hi,

    I've seen this error now with GATK 3.3-0 using 2050 exomes. HC in ERC GVCF mode was used. All 2050 samples were combined with CombineGVCFs into 1 big GVCF file per chromosome. No threading was used. Chromosomes 1,3,4,5,8,12,13,16 and 20 all crashed.

    I did not see it for 787 different samples processed. Although those were processed with a different exome capture and the January 15th GATK nightly did the CombineGVCFs and GenotypeGVCFs stuff (the single sample GVCFs were created with 3.3-0).

    Kurt

  • FerFer AustriaMember

    Thanks @dtaliun for spotting the problem, and thanks @Geraldine_VdAuwera for pointing at the problematic alleles. Not surprisingly it's a TE.
    Would you post here (or can I follow somewhere) when you upload the nightly version incorporating the solution to this issue?
    Regards.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @Fer Because this involves a change to a library dependency (htsjdk) the fix won't be available in the nightly builds, it will only be in the next official release, version 3.4. The good news is that we are planning to release 3.4 very soon (probably next week) and we'll announce that widely.

  • hns04hns04 Member
    edited September 2015

    @Geraldine_VdAuwera Any update on this?. I am running UG on a single sample with -nt 12 but I am getting the java.lang.ArrayIndexOutOfBoundsException: 200 Error. We are using the GATK version 3.3. Any work around for this or would we need to update to 3.4?. Also, I am able to produce the allbases file without a problem.

    Thanks,
    Himanshu

  • SheilaSheila Broad InstituteMember, Broadie admin

    @hns04
    Hi,

    We are no longer supporting Unified Genotyper. We recommend Haplotype Caller for variant calling.

    -Sheila

  • hns04hns04 Member

    @Sheila Thanks for your reply. It seems as though this issue is with both UG and HC. I just wanted to know if I switched over to 3.4 , is the java.lang.ArrayIndexOutOfBoundsException: 200 issue fixed?.

    Thanks a lot for your prompt reply. Really appreciate all the help.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    ArrayIndexOutOfBoundsException problems can pop up for different reasons. The bug underlying the instance reported in this thread is fixed and no longer occurs in 3.4. If the problem you are experiencing was caused by that bug (which seems likely) your problem will be solved by upgrading.

Sign In or Register to comment.