picard-tools MarkDuplicates - spilling to disk
Hope someone can shed some light on this issue.
I have problem running picard-tools MarkDuplicates. I get an error "No space left on device". Having a bit of a search I found people mention that it might be an issue with the tmpdir folder specified. However the folder I'm using for tmpdir is massive (72GB). Looking a bit more at the error log, I found the retain data points before spilling to disk line.
It had a number that matched very closely to the number of records read before the error message. (28872640 vs 29,000,000)
INFO 2015-09-03 15:53:32 MarkDuplicates Will retain up to 28872640 data points before spilling to disk.
INFO 2015-09-03 15:55:50 MarkDuplicates Read 29,000,000 records. Elapsed time: 00:02:18s. Time for last 1,000,000: 4s. Last read position: chr7:39,503,936
INFO 2015-09-03 15:55:50 MarkDuplicates Tracking 195949 as yet unmatched pairs. 13309 records in RAM.
[Thu Sep 03 15:55:53 EST 2015] picard.sam.markduplicates.MarkDuplicates done. Elapsed time: 2.35 minutes.
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Exception in thread "main" htsjdk.samtools.util.RuntimeIOException: java.io.IOException: No space left on device
Caused by: java.io.IOException: No space left on device
at java.io.FileOutputStream.writeBytes(Native Method)
... 6 more
I had a play around with the memory option of java (-Xmx??g) when I issue my MarkDuplicates call, and I see that increase in memory increase the number of data points before spilling to disk. This then increase the number of records read before my "No sapce left in device" error.
eg -Xmx16g gave me 59674689 data points before spilling to disk and I got up to 60,000,000 records read before "no space left on device" error.
I know I can increase my memory to allow for more records, but there is a limit to doing that if I have a huge bam.
What I would like to know is what does "Will retain up to 28872640 data points before spilling to disk." actually mean. I thought it was a safe guard for memory usage, where if the number of records/data point is excceeded then some will be written to file, thus allowing more records to be read. This mean you can still process a large bam with only a small amount of memory. But it does not seem to work that way from what I'm seeing.
My entry "java -Xmx16g -Djavaio.tmpdir=/short/a32/working_temp -jar $PICARD_TOOLS_DIR/picard.jar MarkDuplicates INPUT=output.bam OUTPUT=output.marked.bam METRICS_FILE=metrics CREATE_INDEX=true VALIDATION_STRINGENCY=LENIENT
Hope you can help and thanks for your assistance in advance.