GATK MarkDuplicates java.lang.OutOfMemoryError despite high memory provided

FPBarthelFPBarthel HoustonMember ✭✭

I am getting java.lang.OutOfMemoryError: GC overhead limit exceeded or java.lang.OutOfMemoryError: Java heap space with MarkDuplicates on several samples.

INFO    2018-08-14 12:16:44     MarkDuplicates  Start of doWork freeMemory: 1647961320; totalMemory: 1686634496; maxMemory: 11453595648
INFO    2018-08-14 12:16:44     MarkDuplicates  Reading input file and constructing read end information.
INFO    2018-08-14 12:16:44     MarkDuplicates  Will retain up to 41498534 data points before spilling to disk.
INFO    2018-08-14 12:16:57     MarkDuplicates  Read     1,000,000 records.  Elapsed time: 00:00:12s.  Time for last 1,000,000:   12s.  Last read position: 1:3,531,086
INFO    2018-08-14 12:16:57     MarkDuplicates  Tracking 14684 as yet unmatched pairs. 1223 records in RAM.
...
...
...
INFO    2018-08-14 12:54:10     MarkDuplicates  Tracking 35191054 as yet unmatched pairs. 34061065 records in RAM.
[Tue Aug 14 13:00:43 EDT 2018] picard.sam.markduplicates.MarkDuplicates done. Elapsed time: 44.01 minutes.
Runtime.totalMemory()=12462850048
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
...
...
...

I have tried increasing memory to 24 GB and --MAX_RECORDS_IN_RAM to 250k * 24 = 60000000 but the output is identical and the error occurs at the same point. The --TMP_DIR parameter is set a large scratch space with several TB of free space.

I realized that there are a lot of "unmatched pairs" so I've tried to use ValidateSamFile on the input but this does not return any errors. The input is approximately 90 GB worth of whole genome data that was aligned using bwa-mem with the -M parameter.

Any thoughts on what else I can try?

Tagged:

Answers

  • FPBarthelFPBarthel HoustonMember ✭✭

    So actually giving 70 GB of RAM to the JVM and setting --MAX_RECORDS_IN_RAM to ~1.4 billion solved this. Nevertheless, it seems that --MAX_RECORDS_IN_RAM does not work for "as yet unmatched pairs". In the example above, it was set to 500000 (the default) and just before it erred out, there were 34 million records in RAM, which was way above the limit. This seemed to have been the cause of the problem. Perhaps this can be fixed?

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @FPBarthel
    Hi,

    Thanks for all the information. I need to ask the team and get back to you.

    -Sheila

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @FPBarthel
    Hi,

    The developer has said "You need to lower Max records in RAM not increase it."

    -Sheila

  • FPBarthelFPBarthel HoustonMember ✭✭
    edited August 2018

    @Sheila the issue seems to be that those records which are marked as yet unmatched pairs get stored in memory regardless of whether you tell GATK not to, eg. even if you set --MAX_RECORDS_IN_RAM to 1 these will be stored in memory and not to swap space as they should be.

    In the example above, it [MAX_RECORDS_IN_RAM] was set to 500000 [500k] (the default) and just before it erred out, there were 34 million records in RAM, which was way above the limit [of 500k]. This seemed to have been the cause of the problem.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @FPBarthel
    Hi,

    Interesting. I need to check with the developer again on this behavior.

    -Sheila

  • yfarjounyfarjoun Broad InstituteDev ✭✭✭

    Does the file pass picard validation? (ValidateSamFile)

  • FPBarthelFPBarthel HoustonMember ✭✭

    It does, the sample quality is very bad though with > 40% PCR duplicates. As I mentioned in the follow-up post, I was able to run it with using -Xmx70GB and --MAX_RECORDS_IN_RAM set to default. In comparison, -Xmx24GB and -Xmx48GB both failed with the given error. Records marked as as yet unmatched pair do not get written to disk (as they should?) after exceeding --MAX_RECORDS_IN_RAM.

  • yfarjounyfarjoun Broad InstituteDev ✭✭✭

    Yes, this is a map of read for which the program has yet to find their mate. it has nothing to do with the MAX_RECORDS_IN_RAM. Here all the "orphan" reads for a given sequence are kept in memory at the same time....not ideal, but that's what the current implementation has. I'm wondering why you have so many "orphans".

    You truncated an important line of the log...could you post more lines from the end? In particular I'm wondering at what position does it finally die.

    Another few questions:
    1. what is the depth of your file?
    2. how long are the reads?
    3. what does the insert-size look like?

  • FPBarthelFPBarthel HoustonMember ✭✭

    Unfortunately lost the log file. I'm regenerating the BAM file so I can re-run MarkDuplicates to reproduce this. Was a while back so unfortunately had to delete BAM files to make room (and log file got overwritten when I changed memory to fix this).

    I do remember that :

    INFO    2018-08-14 12:54:10     MarkDuplicates  Tracking 35191054 as yet unmatched pairs. 34061065 records in RAM.
    

    Was in fact the last entry before execution failed.

    To answer your other questions in the mean time, the depth is approx 30x, read length 2x150 and insert size around 400bp. I can share distributions after I regenerate the BAM.

  • FPBarthelFPBarthel HoustonMember ✭✭
    edited September 2018

    @yfarjoun here are some of the lines at the end of a log from a BAM file that is not exactly the same one as I showed above, but having a similar/same error with very number of "orphaned" reads. The number of orphaned reads being read in memory is increasing steadily until ~ 12 GB of memory is hit, and then errors out with a memory error. That's why I figured that perhaps these 'as yet unmatched' (or orphan) reads are ignoring the MAX_RECORDS_IN_RAM and never written to scratch space even if they exceed a certain threshold. Have to say the data quality for these is exceptionally bad with very high number of PCR duplicates, that may have something to do with it?

    ...
    INFO    2018-09-13 10:55:32     MarkDuplicates  Tracking 21767822 as yet unmatched pairs. 20526543 records in RAM.
    INFO    2018-09-13 10:55:52     MarkDuplicates  Read   154,000,000 records.  Elapsed time: 00:18:45s.  Time for last 1,000,000:   19s.  Last read position: 2:33,141,320
    INFO    2018-09-13 10:55:52     MarkDuplicates  Tracking 22600037 as yet unmatched pairs. 21332180 records in RAM.
    INFO    2018-09-13 10:56:15     MarkDuplicates  Read   155,000,000 records.  Elapsed time: 00:19:09s.  Time for last 1,000,000:   23s.  Last read position: 2:33,141,320
    INFO    2018-09-13 10:56:15     MarkDuplicates  Tracking 23429574 as yet unmatched pairs. 22135887 records in RAM.
    INFO    2018-09-13 10:56:39     MarkDuplicates  Read   156,000,000 records.  Elapsed time: 00:19:33s.  Time for last 1,000,000:   24s.  Last read position: 2:33,141,320
    INFO    2018-09-13 10:56:39     MarkDuplicates  Tracking 24281932 as yet unmatched pairs. 22962678 records in RAM.
    INFO    2018-09-13 10:57:04     MarkDuplicates  Read   157,000,000 records.  Elapsed time: 00:19:57s.  Time for last 1,000,000:   24s.  Last read position: 2:33,141,320
    INFO    2018-09-13 10:57:04     MarkDuplicates  Tracking 25122639 as yet unmatched pairs. 23776819 records in RAM.
    INFO    2018-09-13 10:57:53     MarkDuplicates  Read   158,000,000 records.  Elapsed time: 00:20:46s.  Time for last 1,000,000:   48s.  Last read position: 2:33,141,320
    INFO    2018-09-13 10:57:53     MarkDuplicates  Tracking 25955181 as yet unmatched pairs. 24580814 records in RAM.
    INFO    2018-09-13 10:58:39     MarkDuplicates  Read   159,000,000 records.  Elapsed time: 00:21:33s.  Time for last 1,000,000:   46s.  Last read position: 2:33,141,320
    INFO    2018-09-13 10:58:39     MarkDuplicates  Tracking 26803626 as yet unmatched pairs. 25400146 records in RAM.
    INFO    2018-09-13 10:59:51     MarkDuplicates  Read   160,000,000 records.  Elapsed time: 00:22:44s.  Time for last 1,000,000:   71s.  Last read position: 2:33,141,320
    INFO    2018-09-13 10:59:51     MarkDuplicates  Tracking 27665191 as yet unmatched pairs. 26237937 records in RAM.
    INFO    2018-09-13 11:01:47     MarkDuplicates  Read   161,000,000 records.  Elapsed time: 00:24:41s.  Time for last 1,000,000:  116s.  Last read position: 2:33,141,320
    INFO    2018-09-13 11:01:47     MarkDuplicates  Tracking 28536798 as yet unmatched pairs. 27093649 records in RAM.
    INFO    2018-09-13 11:04:21     MarkDuplicates  Read   162,000,000 records.  Elapsed time: 00:27:14s.  Time for last 1,000,000:  153s.  Last read position: 2:33,141,320
    INFO    2018-09-13 11:04:21     MarkDuplicates  Tracking 29408440 as yet unmatched pairs. 27950149 records in RAM.
    
    INFO    2018-09-13 11:13:15     MarkDuplicates  Read   163,000,000 records.  Elapsed time: 00:36:08s.  Time for last 1,000,000:  534s.  Last read position: 2:33,141,320
    INFO    2018-09-13 11:13:15     MarkDuplicates  Tracking 30293460 as yet unmatched pairs. 28816782 records in RAM.
    [Thu Sep 13 11:31:19 EDT 2018] picard.sam.markduplicates.MarkDuplicates done. Elapsed time: 54.24 minutes.
    Runtime.totalMemory()=12402032640
    To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
    Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
            at htsjdk.samtools.BinaryTagCodec.readTags(BinaryTagCodec.java:282)
            at htsjdk.samtools.BAMRecord.decodeAttributes(BAMRecord.java:415)
            at htsjdk.samtools.BAMRecord.getAttribute(BAMRecord.java:394)
            at htsjdk.samtools.SAMRecord.isValid(SAMRecord.java:2068)
            at htsjdk.samtools.BAMFileReader$BAMFileIterator.advance(BAMFileReader.java:811)
            at htsjdk.samtools.BAMFileReader$BAMFileIterator.next(BAMFileReader.java:797)
            at htsjdk.samtools.BAMFileReader$BAMFileIterator.next(BAMFileReader.java:765)
            at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:576)
            at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:548)
            at htsjdk.samtools.util.PeekableIterator.advance(PeekableIterator.java:71)
            at htsjdk.samtools.util.PeekableIterator.next(PeekableIterator.java:57)
            at htsjdk.samtools.MergingSamRecordIterator.next(MergingSamRecordIterator.java:130)
            at htsjdk.samtools.MergingSamRecordIterator.next(MergingSamRecordIterator.java:38)
            at picard.sam.markduplicates.MarkDuplicates.buildSortedReadEndLists(MarkDuplicates.java:495)
            at picard.sam.markduplicates.MarkDuplicates.doWork(MarkDuplicates.java:232)
            at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:282)
            at org.broadinstitute.hellbender.cmdline.PicardCommandLineProgramExecutor.instanceMain(PicardCommandLineProgramExecutor.java:25)
            at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
            at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
            at org.broadinstitute.hellbender.Main.main(Main.java:289)
    Using GATK jar /projects/barthf/opt/gatk-4.0.8.0/gatk-package-4.0.8.0-local.jar
    Running:
        java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx12g -jar /projects/barthf/opt/gatk-4.0.8.0/gatk-package-4.0.8.0-local.jar MarkDuplicates --INPUT=xx.H5TGK.1.realn.bam --INPUT=xx.H5TGK.2.realn.bam --OUTPUT=xx.realn.mdup.bam --METRICS_FILE=xx.metrics.txt --CREATE_INDEX=true --TMP_DIR=/fastscratch/xx/tmp
    
  • yfarjounyfarjoun Broad InstituteDev ✭✭✭

    That is a very problematic region in the genome....but even then, while I have files with depth > 300k in that region, MarkDuplicates manages to get through them somehow.

    I agree that it's a problem...we need to think about how to solve it....

  • FPBarthelFPBarthel HoustonMember ✭✭

    To confirm what you said, I've checked about ~10 logs from different samples (all with high %-dup) that failed with this error and they are all failing in around 2:33,141,320

  • yfarjounyfarjoun Broad InstituteDev ✭✭✭

    could you try using MarkDuplicatesSpark from GATK?

  • yfarjounyfarjoun Broad InstituteDev ✭✭✭

    Alternatively, consider sorting by queryname first and then running MarkDuplicates. (thanks @emeryj for the offline suggestion!)

  • FPBarthelFPBarthel HoustonMember ✭✭

    Apologies for the late reply here! I don't have access to a Spark cluster unfortunately. Also I managed to get it to run correctly when requesting a Java VM with 70 GB of memory (essentially resolving the issue for me) but raised this topic to point out what I think (but may be wrong) is a bug/memory leak. I don't think this much use of resources can be considered within expectations?

    I don't know if sorting by queryName again would work, but I also use MarkDuplicates as a "gather" step to gather multiple readgroups that were processed seperately using bwa and MergeBamAlignment. The individual readgroups are sorted prior to alignment and adapter marking using FastqToSam.

Sign In or Register to comment.