GATK (v4.0.10.1) CombineGVCFs failing with 'java.lang.OutOfMemoryError'; not using memory provided

zugzugzugzug Member ✭✭
edited October 2018 in Ask the GATK team

Hi,

We ran a CombineGVCFs job using the following command, where gvcfs.list contained only 31 gvcf files with 24 samples each:

$GATK --java-options "-Xmx650G" \
CombineGVCFs \
-R $referenceFasta \
-O full_cohort.b37.g.vcf \
--variant gvcfs.list

We tried the extreme memory because CombineGVCFs kept failing. This node has 750G of RAM.

Despite the high memory provided, we get the stacktrace below. The total memory reported by GATK is only ~12G, though (Runtime.totalMemory()=12662603776). Am I missing something? I don't understand why GATK is only using 12G of RAM when we provided much more, and then failing with an OutOfMemoryError.

We are currently setting up GenomicsDBImport, but this seems worth reporting.

Really appreciate your help.

18:55:51.944 INFO ProgressMeter - 4:26649295 23.6 18617000 787894.4
18:56:01.988 INFO ProgressMeter - 4:26655758 23.8 18779000 789159.6
18:59:13.407 INFO CombineGVCFs - Shutting down engine
[October 19, 2018 6:59:13 PM CDT] org.broadinstitute.hellbender.tools.walkers.CombineGVCFs done. Elapsed time: 27.06 minutes.
Runtime.totalMemory()=12662603776
Exception in thread "main" java.lang.OutOfMemoryError
at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117)
at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221)
at sun.nio.cs.StreamEncoder.implClose(StreamEncoder.java:316)
at sun.nio.cs.StreamEncoder.close(StreamEncoder.java:149)
at java.io.OutputStreamWriter.close(OutputStreamWriter.java:233)
at java.io.BufferedWriter.close(BufferedWriter.java:266)
at htsjdk.variant.variantcontext.writer.VCFWriter.close(VCFWriter.java:226)
at org.broadinstitute.hellbender.tools.walkers.CombineGVCFs.closeTool(CombineGVCFs.java:461)
at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:970)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:139)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:192)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:211)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
at org.broadinstitute.hellbender.Main.main(Main.java:289)

Post edited by zugzug on

Best Answer

  • zugzugzugzug ✭✭
    Accepted Answer

    @shlee,

    We finally figured out the problem. It's hard to explain the reasoning for why we're doing this (paper coming), but the issue turned out to be that CombineGVCFs does not seem to like combining GVCFs from different ploidies (same organism).

    Not sure why that resulted in a memory error, though.

Answers

  • zugzugzugzug Member ✭✭

    As an update, looks like GenomicsDBImport only supports diploid data, so we cannot use it. Would really appreciate your help on this.

    Issue · Github
    by shlee

    Issue Number
    5383
    State
    closed
    Last Updated
    Assignee
    Array
    Closed By
    sooheelee
  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @zugzug,

    Here I'm taking a stab in the dark but perhaps improving resource management by helping garbage collection along will get your CombineGVCFs command to succeed. Here are some resources towards setting garbage collection:

    If this doesn't help, I'll need to consult with the developers. Towards this, it would be great if you can explain a bit about your setup, e.g. are you using WDL, Docker, a server etc. Thanks.

  • zugzugzugzug Member ✭✭
    Accepted Answer

    @shlee,

    We finally figured out the problem. It's hard to explain the reasoning for why we're doing this (paper coming), but the issue turned out to be that CombineGVCFs does not seem to like combining GVCFs from different ploidies (same organism).

    Not sure why that resulted in a memory error, though.

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Thanks for reporting back you've worked around the issue @zugzug. It would be great to confirm the actual cause of the error is either ploidy or memory. If the cause is ploidy, we can ask to improve how the tool handles these.

    I've not tried CombineGVCFs on multiploidy data but I can tell you that GenotypeGVCFs is able to handle multiple-ploidies. I believe what I tested was HaplotypeCaller-->single-sample GVCFs-->GenotypeGVCFs.

    Are you able to move forward in your research?

  • zugzugzugzug Member ✭✭

    Yes, we seem to be in business treating each ploidy set separately.

  • zugzugzugzug Member ✭✭

    Hello @shlee,

    I spoke too soon, unfortunately. We can't seem to get more than 120 samples combined into a single GVCF, and the problem is correlated with higher ploidies.

    At this point, we're genotyping the GVCFs for each set of 120 samples and then combining the resulting VCFs. The problem we're seeing with this method, however, is as follows: sets of 120 where a mutation is not observed are given an undefined genotype (./.), whereas the set of 120 in which the mutation was observed has valid genotypes (e.g., 0/0).

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭
    edited October 2018

    Hi @zugzug,

    Bear with me as I'm rather new to the germline calling questions. Sheila used to handle these and I have mostly focused on pre-processing and the new somatic workflows so far.

    I think what you are saying is that when locus A across cohort120-X does not have a single sample with the alt allele, then the genotype ./. is given to the samples across locus A. However, if cohort120-Y has a sample with an alt for locus A, then you see valid genotypes of 0/0 and 0/1. Is this what you observe? It really helps us to have illustrative example records. In lieu of this, to help us clarify the issue and get to the root of the problem, would it be possible for you to submit subset test data that recapitulates what you observe? Instructions for submitting data are at https://software.broadinstitute.org/gatk/guide/article?id=1894. Thank you.

    P.S. I am looking into placing a feature request on your behalf for GenomicsDB to accept mixed-ploidy and non-diploid data at https://github.com/broadinstitute/gatk/issues/5383.

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @zugzug,

    Just following up.

    As an update, looks like GenomicsDBImport only supports diploid data, so we cannot use it. Would really appreciate your help on this.

    Our developer asks if you can test with the latest GenomicsDBImport whether the tool accepts your mixed-ploidy data. GenomicsDBImport should handle non-diploid data.

  • zugzugzugzug Member ✭✭

    Thank you for following up. Should it also work with mixed ploidies?

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    @zugzug, given GenotypeGVCFs handles any ploidy and a mix of ploidies, it makes sense that GenomicsDBImport should follow suit. If it doesn't, then this is something we will rectify.

  • cmtcmt Seattle WAMember
    Hi @zugzug and @shlee

    I have run into this same problem with my data set- I am trying to combine 12 GVCF files, each with only 1 pooled WGS sample, and all GVCF files have a ploidy of 20.

    I was under the impression that GenomicsDBImport could not handle non-diploid data, so I was trying to use CombineGVCFs to combine all 12 GVCFs into one to use in GenotypeGVCFs. I have had the same
    "java.lang.OutOfMemoryError" using CombineGVCFs, no matter how high I go with the memory.

    Did you have any success using GenomicsDBImport on your non-diploid data?
  • AdelaideRAdelaideR Unconfirmed, Member, Broadie, Moderator admin

    HI @cmt

    Have you followed all of the caveats in the tool documentation

    Caveats
    * IMPORTANT: The -Xmx value the tool is run with should be less than the total amount of physical memory available by at least a few GB, as the native TileDB library requires additional memory on top of the Java memory. Failure to leave enough memory for the native code can result in confusing error messages!
    * At least one interval must be provided
    * Input GVCFs cannot contain multiple entries for a single genomic position
    * The --genomicsdb-workspace-path must point to a non-existent or empty directory.
    * GenomicsDBImport uses temporary disk storage during import. The amount of temporary disk storage required can exceed the space available, especially when specifying a large number of intervals. The command line argument `--tmp-dir` can be used to specify an alternate temporary storage location with sufficient space..
    
    

    If you have and it still is not working, can you please provide the error logs?

  • cmtcmt Seattle WAMember

    Hi @AdelaideR, sorry, I missed your message, thank you for your response!

    I think that I have followed the caveats, as far as I can tell, but I do get confusing error messages...

    I have tried a couple different methods for JointGenotyping. My ideal method was to do all 12 of my pools at once, with GDBImports, then GenotypeGVCFs then GatherVcsfs, using a list of 24 intervals (each of my linkage groups) that it scatters over, then combines all the VCFs into one at the end. The GDBImports works, building my data base with no issues, but Genotyping fails. The GenotypeGVCFs takes a long time- every linkage group uses all 3 of the preemptibles before starting a fourth time and failing with this error:

    Picked up _JAVA_OPTIONS: -Djava.io.tmpdir=/cromwell_root/tmp.33612df3 
    [I cut out all the warnings ]
    java: /home/vagrant/GenomicsDB/dependencies/htslib/vcf.c:3641: bcf_update_format: Assertion `nps && nps*line->n_sample==n' failed.
    Using GATK jar /gatk/gatk-package-4.1.0.0-local.jar
    Running:
        java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx20g -Xms20g -jar /gatk/gatk-package-4.1.0.0-local.jar GenotypeGVCFs -R /cromwell_root/uw-hauser-pcod-gatk-tests/gadMor2.fasta -O Genotyped.vcf.gz -G AS_StandardAnnotation -G StandardHCAnnotation --disable-read-filter NotDuplicateReadFilter -V gendb://genomicsdb_LGs --verbosity ERROR -L LG03:1-29451055
    

    This is the command I'm running:

    task GenotypeGVCFs {
      File workspace_tar
      String interval
      File ref_fasta
      File ref_fasta_index
      File ref_dict
      String output_vcf_filename
      Int disk_size
      Int preemptible
    
      command <<<
        set -e
    
        tar -xf ${workspace_tar}
        WORKSPACE=$( basename ${workspace_tar} .tar)
    
        "/gatk/gatk" --java-options "-Xmx20g -Xms20g" \
         GenotypeGVCFs \
         -R ${ref_fasta} \
         -O ${output_vcf_filename} \
         -G AS_StandardAnnotation \
         -G StandardHCAnnotation \
         --disable-read-filter NotDuplicateReadFilter \
         -V gendb://$WORKSPACE \
         --verbosity ERROR \
         -L ${interval}
    
      >>>
      runtime {
        docker: "broadinstitute/gatk:4.1.0.0"
        memory: "32 GB"
        cpu: "2"
        disks: "local-disk " + disk_size + " HDD"
        preemptible: preemptible
      }
      output {
        File output_vcf = "${output_vcf_filename}"
        File output_vcf_index = "${output_vcf_filename}.tbi"
      }
    }
    
    

    I thought maybe it was a memory issue, so I increased the memory for everything, but that just made it take longer until it failed, and it cost a lot. I googled the error "bcf_update_format: Assertion `nps && nps*line->n_sample==n' failed", but only got one hit on a forum about BCFtools, and though it made some sense (something about excess PLs, a field in the gvcf that I have had to troubleshoot before) I didn't see any solutions or indication I was having the same problem, so I just kept moving ahead.

    So I have now broken up the wdl so that it can be run on all 12 pools, but one interval at a time, so I can see where the issues might be. I'm still having trouble, but it might just be blatant operator error as I work on getting the code to work.

    Do you know what the error "bcf_update_format: Assertion `nps && nps*line->n_sample==n' failed." might mean?

    Thanks!

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @cmt

    This might be firecloud/cromwell question. @SChaluvadi Will be able to help you out with this. i am moving this question over to the firecloud team.

  • cmtcmt Seattle WAMember

    hi @bhanuGandham and @SChaluvadi

    I'm no longer having this problem, but I am still having some issues. I'll move over to the firecloud/cromwell forum and ask a new question there!

    Thanks!

Sign In or Register to comment.