Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
GATK MarkDuplicates java.lang.OutOfMemoryError despite high memory provided
I am getting
java.lang.OutOfMemoryError: GC overhead limit exceeded or
java.lang.OutOfMemoryError: Java heap space with MarkDuplicates on several samples.
INFO 2018-08-14 12:16:44 MarkDuplicates Start of doWork freeMemory: 1647961320; totalMemory: 1686634496; maxMemory: 11453595648 INFO 2018-08-14 12:16:44 MarkDuplicates Reading input file and constructing read end information. INFO 2018-08-14 12:16:44 MarkDuplicates Will retain up to 41498534 data points before spilling to disk. INFO 2018-08-14 12:16:57 MarkDuplicates Read 1,000,000 records. Elapsed time: 00:00:12s. Time for last 1,000,000: 12s. Last read position: 1:3,531,086 INFO 2018-08-14 12:16:57 MarkDuplicates Tracking 14684 as yet unmatched pairs. 1223 records in RAM. ... ... ... INFO 2018-08-14 12:54:10 MarkDuplicates Tracking 35191054 as yet unmatched pairs. 34061065 records in RAM. [Tue Aug 14 13:00:43 EDT 2018] picard.sam.markduplicates.MarkDuplicates done. Elapsed time: 44.01 minutes. Runtime.totalMemory()=12462850048 Exception in thread "main" java.lang.OutOfMemoryError: Java heap space ... ... ...
I have tried increasing memory to 24 GB and
--MAX_RECORDS_IN_RAM to 250k * 24 = 60000000 but the output is identical and the error occurs at the same point. The
--TMP_DIR parameter is set a large scratch space with several TB of free space.
I realized that there are a lot of "unmatched pairs" so I've tried to use
ValidateSamFile on the input but this does not return any errors. The input is approximately 90 GB worth of whole genome data that was aligned using
bwa-mem with the
Any thoughts on what else I can try?