GATK (v4.0.10.1) CombineGVCFs failing with 'java.lang.OutOfMemoryError'; not using memory provided

zugzugzugzug Member ✭✭
edited October 2018 in Ask the GATK team

Hi,

We ran a CombineGVCFs job using the following command, where gvcfs.list contained only 31 gvcf files with 24 samples each:

$GATK --java-options "-Xmx650G" \
CombineGVCFs \
-R $referenceFasta \
-O full_cohort.b37.g.vcf \
--variant gvcfs.list

We tried the extreme memory because CombineGVCFs kept failing. This node has 750G of RAM.

Despite the high memory provided, we get the stacktrace below. The total memory reported by GATK is only ~12G, though (Runtime.totalMemory()=12662603776). Am I missing something? I don't understand why GATK is only using 12G of RAM when we provided much more, and then failing with an OutOfMemoryError.

We are currently setting up GenomicsDBImport, but this seems worth reporting.

Really appreciate your help.

18:55:51.944 INFO ProgressMeter - 4:26649295 23.6 18617000 787894.4
18:56:01.988 INFO ProgressMeter - 4:26655758 23.8 18779000 789159.6
18:59:13.407 INFO CombineGVCFs - Shutting down engine
[October 19, 2018 6:59:13 PM CDT] org.broadinstitute.hellbender.tools.walkers.CombineGVCFs done. Elapsed time: 27.06 minutes.
Runtime.totalMemory()=12662603776
Exception in thread "main" java.lang.OutOfMemoryError
at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117)
at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221)
at sun.nio.cs.StreamEncoder.implClose(StreamEncoder.java:316)
at sun.nio.cs.StreamEncoder.close(StreamEncoder.java:149)
at java.io.OutputStreamWriter.close(OutputStreamWriter.java:233)
at java.io.BufferedWriter.close(BufferedWriter.java:266)
at htsjdk.variant.variantcontext.writer.VCFWriter.close(VCFWriter.java:226)
at org.broadinstitute.hellbender.tools.walkers.CombineGVCFs.closeTool(CombineGVCFs.java:461)
at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:970)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:139)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:192)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:211)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
at org.broadinstitute.hellbender.Main.main(Main.java:289)

Post edited by zugzug on

Best Answer

  • zugzugzugzug ✭✭
    Accepted Answer

    @shlee,

    We finally figured out the problem. It's hard to explain the reasoning for why we're doing this (paper coming), but the issue turned out to be that CombineGVCFs does not seem to like combining GVCFs from different ploidies (same organism).

    Not sure why that resulted in a memory error, though.

Answers

  • zugzugzugzug Member ✭✭

    As an update, looks like GenomicsDBImport only supports diploid data, so we cannot use it. Would really appreciate your help on this.

    Issue · Github
    by shlee

    Issue Number
    5383
    State
    open
    Last Updated
    Assignee
    Array
  • shleeshlee CambridgeMember, Broadie, Moderator admin

    Hi @zugzug,

    Here I'm taking a stab in the dark but perhaps improving resource management by helping garbage collection along will get your CombineGVCFs command to succeed. Here are some resources towards setting garbage collection:

    If this doesn't help, I'll need to consult with the developers. Towards this, it would be great if you can explain a bit about your setup, e.g. are you using WDL, Docker, a server etc. Thanks.

  • zugzugzugzug Member ✭✭
    Accepted Answer

    @shlee,

    We finally figured out the problem. It's hard to explain the reasoning for why we're doing this (paper coming), but the issue turned out to be that CombineGVCFs does not seem to like combining GVCFs from different ploidies (same organism).

    Not sure why that resulted in a memory error, though.

  • shleeshlee CambridgeMember, Broadie, Moderator admin

    Thanks for reporting back you've worked around the issue @zugzug. It would be great to confirm the actual cause of the error is either ploidy or memory. If the cause is ploidy, we can ask to improve how the tool handles these.

    I've not tried CombineGVCFs on multiploidy data but I can tell you that GenotypeGVCFs is able to handle multiple-ploidies. I believe what I tested was HaplotypeCaller-->single-sample GVCFs-->GenotypeGVCFs.

    Are you able to move forward in your research?

  • zugzugzugzug Member ✭✭

    Yes, we seem to be in business treating each ploidy set separately.

  • zugzugzugzug Member ✭✭

    Hello @shlee,

    I spoke too soon, unfortunately. We can't seem to get more than 120 samples combined into a single GVCF, and the problem is correlated with higher ploidies.

    At this point, we're genotyping the GVCFs for each set of 120 samples and then combining the resulting VCFs. The problem we're seeing with this method, however, is as follows: sets of 120 where a mutation is not observed are given an undefined genotype (./.), whereas the set of 120 in which the mutation was observed has valid genotypes (e.g., 0/0).

  • shleeshlee CambridgeMember, Broadie, Moderator admin
    edited October 2018

    Hi @zugzug,

    Bear with me as I'm rather new to the germline calling questions. Sheila used to handle these and I have mostly focused on pre-processing and the new somatic workflows so far.

    I think what you are saying is that when locus A across cohort120-X does not have a single sample with the alt allele, then the genotype ./. is given to the samples across locus A. However, if cohort120-Y has a sample with an alt for locus A, then you see valid genotypes of 0/0 and 0/1. Is this what you observe? It really helps us to have illustrative example records. In lieu of this, to help us clarify the issue and get to the root of the problem, would it be possible for you to submit subset test data that recapitulates what you observe? Instructions for submitting data are at https://software.broadinstitute.org/gatk/guide/article?id=1894. Thank you.

    P.S. I am looking into placing a feature request on your behalf for GenomicsDB to accept mixed-ploidy and non-diploid data at https://github.com/broadinstitute/gatk/issues/5383.

  • shleeshlee CambridgeMember, Broadie, Moderator admin

    Hi @zugzug,

    Just following up.

    As an update, looks like GenomicsDBImport only supports diploid data, so we cannot use it. Would really appreciate your help on this.

    Our developer asks if you can test with the latest GenomicsDBImport whether the tool accepts your mixed-ploidy data. GenomicsDBImport should handle non-diploid data.

  • zugzugzugzug Member ✭✭

    Thank you for following up. Should it also work with mixed ploidies?

  • shleeshlee CambridgeMember, Broadie, Moderator admin

    @zugzug, given GenotypeGVCFs handles any ploidy and a mix of ploidies, it makes sense that GenomicsDBImport should follow suit. If it doesn't, then this is something we will rectify.

Sign In or Register to comment.