The current GATK version is 3.7-0
Examples: Monday, today, last week, Mar 26, 3/26/04

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Get notifications!


You can opt in to receive email notifications, for example when your questions get answered or when there are new announcements, by following the instructions given here.

Did you remember to?


1. Search using the upper-right search box, e.g. using the error message.
2. Try the latest version of tools.
3. Include tool and Java versions.
4. Tell us whether you are following GATK Best Practices.
5. Include relevant details, e.g. platform, DNA- or RNA-Seq, WES (+capture kit) or WGS (PCR-free or PCR+), paired- or single-end, read length, expected average coverage, somatic data, etc.
6. For tool errors, include the error stacktrace as well as the exact command.
7. For format issues, include the result of running ValidateSamFile for BAMs or ValidateVariants for VCFs.
8. For weird results, include an illustrative example, e.g. attach IGV screenshots according to Article#5484.
9. For a seeming variant that is uncalled, include results of following Article#1235.

Did we ask for a bug report?


Then follow instructions in Article#1894.

Formatting tip!


Wrap blocks of code, error messages and BAM/VCF snippets--especially content with hashes (#)--with lines with three backticks ( ``` ) each to make a code block as demonstrated here.

Jump to another community
Picard 2.9.0 is now available. Download and read release notes here.
GATK 3.7 is here! Be sure to read the Version Highlights and optionally the full Release Notes.

picard-tools MarkDuplicates - spilling to disk

ipeddieipeddie Sydney, AustraliaMember Posts: 4

Hi there,

Hope someone can shed some light on this issue.

I have problem running picard-tools MarkDuplicates. I get an error "No space left on device". Having a bit of a search I found people mention that it might be an issue with the tmpdir folder specified. However the folder I'm using for tmpdir is massive (72GB). Looking a bit more at the error log, I found the retain data points before spilling to disk line.

It had a number that matched very closely to the number of records read before the error message. (28872640 vs 29,000,000)

INFO 2015-09-03 15:53:32 MarkDuplicates Will retain up to 28872640 data points before spilling to disk.
...
INFO 2015-09-03 15:55:50 MarkDuplicates Read 29,000,000 records. Elapsed time: 00:02:18s. Time for last 1,000,000: 4s. Last read position: chr7:39,503,936
INFO 2015-09-03 15:55:50 MarkDuplicates Tracking 195949 as yet unmatched pairs. 13309 records in RAM.
[Thu Sep 03 15:55:53 EST 2015] picard.sam.markduplicates.MarkDuplicates done. Elapsed time: 2.35 minutes.
Runtime.totalMemory()=6107234304
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Exception in thread "main" htsjdk.samtools.util.RuntimeIOException: java.io.IOException: No space left on device
at htsjdk.samtools.util.SortingCollection.spillToDisk(SortingCollection.java:245)
at htsjdk.samtools.util.SortingCollection.add(SortingCollection.java:165)
at picard.sam.markduplicates.MarkDuplicates.buildSortedReadEndLists(MarkDuplicates.java:281)
at picard.sam.markduplicates.MarkDuplicates.doWork(MarkDuplicates.java:114)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:206)
at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:95)
at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:105)
Caused by: java.io.IOException: No space left on device
at java.io.FileOutputStream.writeBytes(Native Method)
at java.io.FileOutputStream.write(FileOutputStream.java:318)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
at org.xerial.snappy.SnappyOutputStream.dump(SnappyOutputStream.java:127)
at org.xerial.snappy.SnappyOutputStream.flush(SnappyOutputStream.java:100)
at org.xerial.snappy.SnappyOutputStream.close(SnappyOutputStream.java:137)
at htsjdk.samtools.util.SortingCollection.spillToDisk(SortingCollection.java:236)
... 6 more

I had a play around with the memory option of java (-Xmx??g) when I issue my MarkDuplicates call, and I see that increase in memory increase the number of data points before spilling to disk. This then increase the number of records read before my "No sapce left in device" error.

eg -Xmx16g gave me 59674689 data points before spilling to disk and I got up to 60,000,000 records read before "no space left on device" error.

I know I can increase my memory to allow for more records, but there is a limit to doing that if I have a huge bam.

What I would like to know is what does "Will retain up to 28872640 data points before spilling to disk." actually mean. I thought it was a safe guard for memory usage, where if the number of records/data point is excceeded then some will be written to file, thus allowing more records to be read. This mean you can still process a large bam with only a small amount of memory. But it does not seem to work that way from what I'm seeing.

My entry "java -Xmx16g -Djavaio.tmpdir=/short/a32/working_temp -jar $PICARD_TOOLS_DIR/picard.jar MarkDuplicates INPUT=output.bam OUTPUT=output.marked.bam METRICS_FILE=metrics CREATE_INDEX=true VALIDATION_STRINGENCY=LENIENT
"

Hope you can help and thanks for your assistance in advance.
Eddie

Issue · Github
by Sheila

Issue Number
160
State
closed
Last Updated
Milestone
Array
Closed By
chandrans

Comments

  • SheilaSheila Broad InstituteMember, Broadie, Moderator Posts: 4,693 admin

    @ipeddie
    Hi Eddie,

    @thibault helped me out here. It turns out you need to set the Picard option for TMP_DIR. Have a look at this page: https://broadinstitute.github.io/picard/command-line-overview.html under Standard Options. Right now, you are setting the Java option for temp dirs, which MarkDuplicates does not use.

    -Sheila

  • ipeddieipeddie Sydney, AustraliaMember Posts: 4

    Hi Sheila,

    thanks for the reply.

    So my command line should read :

    "java -jar $PICARD_TOOLS_DIR/picard.jar MarkDuplicates INPUT=output.bam OUTPUT=output.marked.bam METRICS_FILE=metrics CREATE_INDEX=true VALIDATION_STRINGENCY=LENIENT TMP_DIR=/short/a32/working_temp"

    by the way, can you explain what "before spilling to disk." actually mean in the INFO statement. I'm still a bit confused by that.

    Thanks..
    Eddie

  • SheilaSheila Broad InstituteMember, Broadie, Moderator Posts: 4,693 admin

    @ipeddie
    Hi Eddie,

    I had to ask Joel again for some help with this one.

    It turns out there is a data structure that takes up a lot of RAM, so it stores some of that on disk temporarily. That's why it needs a temp directory. The only thing you need to worry about is supplying a temp directory with enough space, which is what you have done correctly with your above command.

    -Sheila

Sign In or Register to comment.