Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

GetPileupSummaries runs out of memory

I'm trying to run the cross sample contamination check on my samples, but GetPileupSummaries (4.1.1.0) keeps running out of memory, even when running a single sample on a VM that has >200GB of RAM available.

14:35:16.874 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/gatk/gatk-package-4.1.1.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
14:35:17.116 INFO  GetPileupSummaries - ------------------------------------------------------------
14:35:17.117 INFO  GetPileupSummaries - The Genome Analysis Toolkit (GATK) v4.1.1.0
14:35:17.117 INFO  GetPileupSummaries - For support and documentation go to https://software.broadinstitute.org/gatk/
14:35:17.118 INFO  GetPileupSummaries - Executing as [email protected] on Linux v4.15.0-47-generic amd64
14:35:17.118 INFO  GetPileupSummaries - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_191-8u191-b12-0ubuntu0.16.04.1-b12
14:35:17.118 INFO  GetPileupSummaries - Start Date/Time: April 24, 2019 2:35:16 PM UTC
14:35:17.118 INFO  GetPileupSummaries - ------------------------------------------------------------
14:35:17.119 INFO  GetPileupSummaries - ------------------------------------------------------------
14:35:17.119 INFO  GetPileupSummaries - HTSJDK Version: 2.19.0
14:35:17.119 INFO  GetPileupSummaries - Picard Version: 2.19.0
14:35:17.120 INFO  GetPileupSummaries - HTSJDK Defaults.COMPRESSION_LEVEL : 2
14:35:17.120 INFO  GetPileupSummaries - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
14:35:17.120 INFO  GetPileupSummaries - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
14:35:17.120 INFO  GetPileupSummaries - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
14:35:17.120 INFO  GetPileupSummaries - Deflater: IntelDeflater
14:35:17.120 INFO  GetPileupSummaries - Inflater: IntelInflater
14:35:17.121 INFO  GetPileupSummaries - GCS max retries/reopens: 20
14:35:17.121 INFO  GetPileupSummaries - Requester pays: disabled
14:35:17.121 WARN  GetPileupSummaries -

   !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

   Warning: GetPileupSummaries is a BETA tool and is not yet ready for use in production

   !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!


14:35:17.121 INFO  GetPileupSummaries - Initializing engine
14:35:17.456 INFO  FeatureManager - Using codec VCFCodec to read file file:///gatk/data/gnomad/vcf/genomes/liftover_grch38/gnomad.b38.biallelic_only.concat.sorted.filtered.vcf.gz
14:35:17.586 INFO  FeatureManager - Using codec VCFCodec to read file file:///gatk/data/gnomad/vcf/genomes/liftover_grch38/gnomad.b38.biallelic_only.concat.sorted.filtered.vcf.gz
16:39:08.359 INFO  IntervalArgumentCollection - Processing 236373212 bp from intervals
16:41:01.520 INFO  GetPileupSummaries - Done initializing engine
16:41:01.521 INFO  ProgressMeter - Starting traversal
16:41:01.521 INFO  ProgressMeter -        Current Locus  Elapsed Minutes        Loci Processed      Loci/Minute
02:44:42.116 INFO  GetPileupSummaries - Shutting down engine
[April 25, 2019 2:44:42 AM UTC] org.broadinstitute.hellbender.tools.walkers.contamination.GetPileupSummaries done. Elapsed time: 729.42 minutes.
Runtime.totalMemory()=23243784192
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
        at java.util.Arrays.copyOf(Arrays.java:3181)
        at java.util.ArrayList.grow(ArrayList.java:265)
        at java.util.ArrayList.ensureExplicitCapacity(ArrayList.java:239)
        at java.util.ArrayList.ensureCapacityInternal(ArrayList.java:231)
        at java.util.ArrayList.add(ArrayList.java:462)
        at htsjdk.samtools.BinningIndexContent.getChunksOverlapping(BinningIndexContent.java:131)
        at htsjdk.samtools.CachingBAMFileIndex.getSpanOverlapping(CachingBAMFileIndex.java:75)
        at htsjdk.samtools.BAMFileReader.getFileSpan(BAMFileReader.java:935)
        at htsjdk.samtools.BAMFileReader.createIndexIterator(BAMFileReader.java:952)
        at htsjdk.samtools.BAMFileReader.query(BAMFileReader.java:612)
        at htsjdk.samtools.SamReader$PrimitiveSamReaderToSamReaderAdapter.query(SamReader.java:533)
        at htsjdk.samtools.SamReader$PrimitiveSamReaderToSamReaderAdapter.queryOverlapping(SamReader.java:405)
        at org.broadinstitute.hellbender.utils.iterators.SamReaderQueryingIterator.loadNextIterator(SamReaderQueryingIterator.java:125)
        at org.broadinstitute.hellbender.utils.iterators.SamReaderQueryingIterator.(SamReaderQueryingIterator.java:66)
        at org.broadinstitute.hellbender.engine.ReadsDataSource.prepareIteratorsForTraversal(ReadsDataSource.java:404)
        at org.broadinstitute.hellbender.engine.ReadsDataSource.iterator(ReadsDataSource.java:330)
        at java.lang.Iterable.spliterator(Iterable.java:101)
        at org.broadinstitute.hellbender.utils.Utils.stream(Utils.java:1098)
        at org.broadinstitute.hellbender.engine.GATKTool.getTransformedReadStream(GATKTool.java:321)
        at org.broadinstitute.hellbender.engine.LocusWalker.traverse(LocusWalker.java:159)
        at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:984)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:138)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:191)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:210)
        at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:162)
        at org.broadinstitute.hellbender.Main.mainEntry(Main.java:205)
        at org.broadinstitute.hellbender.Main.main(Main.java:291)
Using GATK jar /gatk/gatk-package-4.1.1.0-local.jar

The samples I'm running it on are hg38 aligned, ~200GB bam files that have been merged from multiple lanes, and sometimes two different flowcells. Other than than, nothing special about them. I have been able to run the contamination check successfully on other, non-merged samples.

With this particular run, I tried defining --java-options "-Xmx30G" for the GetPileupSummaries process.

Issue · Github
by bhanuGandham

Issue Number
5918
State
open
Last Updated
Assignee
Array

Answers

  • registered_userregistered_user Member
    edited April 25

    I'm running the process in a docker container. I'm running the command like this:

    sudo docker run --name=${container_name} ${gatk_mounts} --cpus 4 --log-driver=json-file -td broadinstitute/gatk;
    
    sudo docker exec ${container_name} gatk GetPileupSummaries --java-options "-Xmx30G" -I ${input_bam_file} -V ${biall_file} -L ${biall_file} -O ${output_pu_file};
    
  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin
    edited April 29

    Hi @registered_user

    Can you try running with default memory options instead of setting a max heap memory -Xmx30G

  • Yes I tried it with the default memory options the first time, same problem.

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    @registered_user

    How much physical memory do you have on the VM?

    You should leave at least 2-3GB of free physical memory in addition to the heap memory. In GATK4 we use some native libraries, and these require additional memory on top of the Java heap memory controlled by -Xmx. So it's usually necessary to ensure that the machine has a few GB of physical memory more than the Java -Xmx value available, to account for memory used by native libraries outside of Java.

  • ...even when running a single sample on a VM that has >200GB of RAM .
  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin
    edited April 29

    @registered_user

    You already mentioned the RAM you are using. Sorry about that, I missed that information. I will check with the dev team and get back to you shortly.

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @registered_user

    Looks like the GetPileupSummaries tool needs a lot of memory. I have created a github issue ticket to see how we can optimize that. You can follow the progress on the issue here: https://github.com/broadinstitute/gatk/issues/5918
    In the mean time, can you please try to set -Xmx150G and run again and see if that resolves it.

  • registered_userregistered_user Member

    OK, I'll try running GetPileupSummaries with "-Xmx150G". There seems to be an issue that my VMs system drive is getting filled up during the run as well. Does GetPileupSummaries write a lot of temporary files? I'm wondering if I could specify a working directory where to write these files when running the algorithm inside a docker container.

  • registered_userregistered_user Member
    edited May 15

    After running the GetPileupSummaries process overnight it is now consuming 154GB of memory and the output log shows no real progress. Looking at the log timestamps, nothing seems to have happened in more than 12 hours. Even if it would eventually finish this run, I don't think running the contamination check for all of my samples is feasible with this kind of memory and time consumption. The system hard drive did not get filled up during the run.

    14:28:39.395 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/gatk/gatk-package-4.1.1.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
    14:28:39.972 INFO  GetPileupSummaries - ------------------------------------------------------------
    14:28:39.973 INFO  GetPileupSummaries - The Genome Analysis Toolkit (GATK) v4.1.1.0
    14:28:39.973 INFO  GetPileupSummaries - For support and documentation go to https://software.broadinstitute.org/gatk/
    14:28:39.973 INFO  GetPileupSummaries - Executing as [email protected] on Linux v4.15.0-47-generic amd64
    14:28:39.974 INFO  GetPileupSummaries - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_191-8u191-b12-0ubuntu0.16.04.1-b12
    14:28:39.974 INFO  GetPileupSummaries - Start Date/Time: May 14, 2019 2:28:39 PM UTC
    14:28:39.974 INFO  GetPileupSummaries - ------------------------------------------------------------
    14:28:39.974 INFO  GetPileupSummaries - ------------------------------------------------------------
    14:28:39.975 INFO  GetPileupSummaries - HTSJDK Version: 2.19.0
    14:28:39.975 INFO  GetPileupSummaries - Picard Version: 2.19.0
    14:28:39.975 INFO  GetPileupSummaries - HTSJDK Defaults.COMPRESSION_LEVEL : 2
    14:28:39.976 INFO  GetPileupSummaries - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
    14:28:39.976 INFO  GetPileupSummaries - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
    14:28:39.976 INFO  GetPileupSummaries - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
    14:28:39.976 INFO  GetPileupSummaries - Deflater: IntelDeflater
    14:28:39.976 INFO  GetPileupSummaries - Inflater: IntelInflater
    14:28:39.976 INFO  GetPileupSummaries - GCS max retries/reopens: 20
    14:28:39.976 INFO  GetPileupSummaries - Requester pays: disabled
    14:28:39.977 WARN  GetPileupSummaries -
    
       !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
    
       Warning: GetPileupSummaries is a BETA tool and is not yet ready for use in production
    
       !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
    
    
    14:28:39.977 INFO  GetPileupSummaries - Initializing engine
    14:28:40.387 INFO  FeatureManager - Using codec VCFCodec to read file file:///gatk/data/gnomad/vcf/genomes/liftover_grch38/gnomad.b38.biallelic_only.concat.sorted.filtered.vcf.gz
    14:28:40.530 INFO  FeatureManager - Using codec VCFCodec to read file file:///gatk/data/gnomad/vcf/genomes/liftover_grch38/gnomad.b38.biallelic_only.concat.sorted.filtered.vcf.gz
    16:47:49.197 INFO  IntervalArgumentCollection - Processing 236373212 bp from intervals
    16:47:53.925 INFO  GetPileupSummaries - Done initializing engine
    16:47:53.926 INFO  ProgressMeter - Starting traversal
    16:47:53.930 INFO  ProgressMeter -        Current Locus  Elapsed Minutes        Loci Processed      Loci/Minute
    
    
  • registered_userregistered_user Member
    edited May 15

    @bhanuGandham If you know where I could find a hg38 version of biallelic gnomad data, I could try using that. Although I think I'm using the same file for this run that I lifted over and have used previously (and successfully) for cross sample contamination checks.

  • AdelaideRAdelaideR Member admin

    Hi @registered_user

    The gnomad data can be found here. If they don' t have the specific version, I would encourage you to contact them directly.

    As for the GetPileupSummaries question, it appears that Bhanu has opened a question about this on our github, you can track the issue here

    For the docker output option, you can mount a directory that connects to a local directory by using the v option when you use docker run. More documentation about docker commands can be found here.

  • registered_userregistered_user Member

    If I'm not mistaken, the gnomAD data you are linking to is in hg37 reference genome format, and that is exactly what the problem is. When I'm doing the liftover myself, I have no way of verifying that GetPileupSummaries does not bug out because of some differences in the format of the lifted file.

    With docker I am already using "-v" to mount some drives, but I have no way of verifying if and where GetPileupSummaries writes any temporary files. I think this is not relevant for the memory issues I am having though.

  • AdelaideRAdelaideR Member admin

    I believe the VEP folks have a cross-mapped version that is downloadable here

    Does this work?

  • atariatari SwitzerlandMember

    It seems that I am facing a very similar problem.
    GATK 4.1.2.0 genome hg38 (without alternative contigs).
    I have a very simple bam containing one single sample of a tumoral cell line.
    I tried to run GetPileupSummaries using 64G of RAM and the biallelic version of GnomeAD (obtained applying SelectVariants on your bundled hg38 gnomAD file).

    I get this error after about 90 minutes of execution (and no output file produced):
    10:01:20.727 INFO ProgressMeter - Current Locus Elapsed Minutes Loci Processed Loci/Minute
    12:33:12.626 INFO GetPileupSummaries - Shutting down engine
    [May 17, 2019 12:33:12 PM CEST] org.broadinstitute.hellbender.tools.walkers.contamination.GetPileupSummaries done. Elapsed time: 163.28 minutes.
    Runtime.totalMemory()=66862448640
    Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
    at java.util.BitSet.initWords(BitSet.java:166)
    at java.util.BitSet.(BitSet.java:161)
    at htsjdk.samtools.GenomicIndexUtil.regionToBins(GenomicIndexUtil.java:115)
    at htsjdk.samtools.BinningIndexContent.getChunksOverlapping(BinningIndexContent.java:121)
    at htsjdk.samtools.CachingBAMFileIndex.getSpanOverlapping(CachingBAMFileIndex.java:75)
    at htsjdk.samtools.BAMFileReader.getFileSpan(BAMFileReader.java:935)
    at htsjdk.samtools.BAMFileReader.createIndexIterator(BAMFileReader.java:952)
    at htsjdk.samtools.BAMFileReader.query(BAMFileReader.java:612)
    at htsjdk.samtools.SamReader$PrimitiveSamReaderToSamReaderAdapter.query(SamReader.java:533)
    at htsjdk.samtools.SamReader$PrimitiveSamReaderToSamReaderAdapter.queryOverlapping(SamReader.java:405)
    at org.broadinstitute.hellbender.utils.iterators.SamReaderQueryingIterator.loadNextIterator(SamReaderQueryingIterator.java:125)
    at org.broadinstitute.hellbender.utils.iterators.SamReaderQueryingIterator.(SamReaderQueryingIterator.java:66)
    at org.broadinstitute.hellbender.engine.ReadsDataSource.prepareIteratorsForTraversal(ReadsDataSource.java:404)
    at org.broadinstitute.hellbender.engine.ReadsDataSource.iterator(ReadsDataSource.java:330)
    at java.lang.Iterable.spliterator(Iterable.java:101)
    at org.broadinstitute.hellbender.utils.Utils.stream(Utils.java:1098)
    at org.broadinstitute.hellbender.engine.GATKTool.getTransformedReadStream(GATKTool.java:376)
    at org.broadinstitute.hellbender.engine.LocusWalker.traverse(LocusWalker.java:164)
    at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:1039)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:139)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:191)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:210)
    at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:162)
    at org.broadinstitute.hellbender.Main.mainEntry(Main.java:205)
    at org.broadinstitute.hellbender.Main.main(Main.java:291)

  • AdelaideRAdelaideR Member admin

    Hi @atari -

    I would try to increase the RAM that you are using - these are very large files. It is helpful to see the exact command that you are running - for example did you set the parameter to output all sites instead of just variant sites?

  • atariatari SwitzerlandMember

    Hi Adelaide,
    This is the command that I am running:
    gatk --java-options "-Xmx80g" GetPileupSummaries \
    -I bqsr_markedDuplicates_18852.bam \
    -V biallelic_hg38_af-only-gnomad.vcf.gz \
    -L biallelic_hg38_af-only-gnomad.vcf.gz \
    -O 7_tumor_getpileupsummaries.table

    I tried to augment the ram to 80 G. After 24 hours the process is still running (may be hanged...).

  • AdelaideRAdelaideR Member admin

    Hi @atari

    I am going to do a little more research to see if I can figure out another strategy.

    Is there a limit to your RAM? Any chance you could bump it up to 240G?

  • atariatari SwitzerlandMember

    Thx.
    Our cluster at the European Institute of Oncology in Milan has currently a maximum of 128 GB of RAM per node.

  • AdelaideRAdelaideR Member admin

    Hi @atari -

    It seems that you could possible increase it to 120G, which is 50% more RAM than the current command. Try that and let us know if the program is still hanging.

    Also, an issue has been created to notify our product development team about this limitation. Please feel free to track the issue here.

    The other option is to use a cloud computing platform. Do you have any access to the cloud through your institute? You can create machines with more RAM just for this part of the analysis.

  • davidbendavidben BostonMember, Broadie, Dev ✭✭✭

    @atari @registered_user In our best practices we recommend using gs://gatk-best-practices/somatic-hg38/small_exac_common_3.hg38.vcf.gz (there is an hg19 version in the same bucket) for the contamination pipeline even when running on a WGS sample. Otherwise you end up loading a huge chunk of gnomAD into RAM. There are plenty of variants in the exome alone to obtain a good contamination estimate.

Sign In or Register to comment.