Spark-related heap error in GATK4 PathSeqBuildKmers

HI, all,

I am encountering a Java heap space error when trying to generate the host k-mer library from the PathSeq resource bundle that I am at a bit of a loss to understand and troubleshoot. The specific error appears to occur after the tool has actually completed its run:

  org.broadinstitute.hellbender.tools.spark.pathseq.PathSeqBuildKmers done. Elapsed time: 12.53 minutes.
Runtime.totalMemory()=68719476736
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at org.broadinstitute.hellbender.tools.spark.utils.LongHopscotchSet.<init>(LongHopscotchSet.java:59)
        at org.broadinstitute.hellbender.tools.spark.utils.LargeLongHopscotchSet.<init>(LargeLongHopscotchSet.java:42)
        at org.broadinstitute.hellbender.tools.spark.pathseq.PSKmerUtils.longArrayCollectionToSet(PSKmerUtils.java:82)
        at org.broadinstitute.hellbender.tools.spark.pathseq.PathSeqBuildKmers.doWork(PathSeqBuildKmers.java:171)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:135)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:180)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:199)
        at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
        at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
        at org.broadinstitute.hellbender.Main.main(Main.java:289)

however, no host.hss file is generated. I am invoking the program with more than enough heap space:

 ./gatk/gatk --java-options "-Xms72G -Xmx72G" PathSeqBuildKmersSpark --reference pathseq_host.fa -O host.hss

I've watched top during a run and can confirm that the process never exceeds around 69GB of memory usage. I've tried to play with the Spark options in case the heap space issue is occurring thre but trying to set "--spark-master local[*]" will throw a: "A USER ERROR has occurred: spark-master is not a recognized option" error whether or not I include --spark-runner LOCAL, so I'm not sure how I'm supposed to configure Spark given that issue. My GATK version is 4.0.5.1-local and I'm using OpenJDK 1.8.0_131 as the JVM. Thanks for any help you can provide and please let me know if you need any additional information on my end.

Hollis Wright, PhD
Assistant Staff Scientist
Oregon Health And Science University

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Tagging @markw would should be able to help you out with this

  • markwmarkw Cambridge, MAMember, Broadie, Moderator, Dev admin

    Hello @wrighth_ohsu

    PathSeqBuildKmers is not a Spark tool, so that is why you cannot use the Spark options.

    This particular tool is memory-heavy, requiring at least 2 * (8 bytes) * (reference length in Gbp) GB of space (~60Gb for pathseq_host.fa). There may be additional overhead that is causing the JVM to run out with 72Gb. Can you try with additional memory?

  • Hi @markw,

    I was confused about the Spark issue as it looks like the current version of the documentation implies that all the tools use Spark unless they have explicit Spark/non-Spark versions:

    https://software.broadinstitute.org/gatk/documentation/article?id=11245

    and of course the org.broadinstitute.hellbender.tools.spark package names, so thank you for clearing that up. Regarding running with more memory, I did try that, but I get memory allocation errors if I try past 72 GB off the bat, before PathSeqBuildKmers even runs. I can't think of any obvious reason that should happen on a node with 200GB available. Any ideas? If not I can talk with my sysadmins; could be there's something wrong with the environment and/or I'm not actually getting my full 200GB allocation for some reason.

  • markwmarkw Cambridge, MAMember, Broadie, Moderator, Dev admin

    Hello @wrighth_ohsu

    I see how that document is confusing. It is saying that tools ending in "Spark" always use Spark but that not all tools that use Spark end in "Spark." What it doesn't explicitly say is that non-Spark tools also do not have "Spark" in the name, which is the case here. You're right, though, that it probably doesn't belong in the Spark tools package :blush:

    If for example --Xmx180g crashes then you will need to contact your sys admin (they may have a limit on allowed memory usage per user or process).

Sign In or Register to comment.