Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

picard-tools MarkDuplicates - spilling to disk

ipeddieipeddie Sydney, AustraliaMember

Hi there,

Hope someone can shed some light on this issue.

I have problem running picard-tools MarkDuplicates. I get an error "No space left on device". Having a bit of a search I found people mention that it might be an issue with the tmpdir folder specified. However the folder I'm using for tmpdir is massive (72GB). Looking a bit more at the error log, I found the retain data points before spilling to disk line.

It had a number that matched very closely to the number of records read before the error message. (28872640 vs 29,000,000)

INFO 2015-09-03 15:53:32 MarkDuplicates Will retain up to 28872640 data points before spilling to disk.
...
INFO 2015-09-03 15:55:50 MarkDuplicates Read 29,000,000 records. Elapsed time: 00:02:18s. Time for last 1,000,000: 4s. Last read position: chr7:39,503,936
INFO 2015-09-03 15:55:50 MarkDuplicates Tracking 195949 as yet unmatched pairs. 13309 records in RAM.
[Thu Sep 03 15:55:53 EST 2015] picard.sam.markduplicates.MarkDuplicates done. Elapsed time: 2.35 minutes.
Runtime.totalMemory()=6107234304
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Exception in thread "main" htsjdk.samtools.util.RuntimeIOException: java.io.IOException: No space left on device
at htsjdk.samtools.util.SortingCollection.spillToDisk(SortingCollection.java:245)
at htsjdk.samtools.util.SortingCollection.add(SortingCollection.java:165)
at picard.sam.markduplicates.MarkDuplicates.buildSortedReadEndLists(MarkDuplicates.java:281)
at picard.sam.markduplicates.MarkDuplicates.doWork(MarkDuplicates.java:114)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:206)
at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:95)
at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:105)
Caused by: java.io.IOException: No space left on device
at java.io.FileOutputStream.writeBytes(Native Method)
at java.io.FileOutputStream.write(FileOutputStream.java:318)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
at org.xerial.snappy.SnappyOutputStream.dump(SnappyOutputStream.java:127)
at org.xerial.snappy.SnappyOutputStream.flush(SnappyOutputStream.java:100)
at org.xerial.snappy.SnappyOutputStream.close(SnappyOutputStream.java:137)
at htsjdk.samtools.util.SortingCollection.spillToDisk(SortingCollection.java:236)
... 6 more

I had a play around with the memory option of java (-Xmx??g) when I issue my MarkDuplicates call, and I see that increase in memory increase the number of data points before spilling to disk. This then increase the number of records read before my "No sapce left in device" error.

eg -Xmx16g gave me 59674689 data points before spilling to disk and I got up to 60,000,000 records read before "no space left on device" error.

I know I can increase my memory to allow for more records, but there is a limit to doing that if I have a huge bam.

What I would like to know is what does "Will retain up to 28872640 data points before spilling to disk." actually mean. I thought it was a safe guard for memory usage, where if the number of records/data point is excceeded then some will be written to file, thus allowing more records to be read. This mean you can still process a large bam with only a small amount of memory. But it does not seem to work that way from what I'm seeing.

My entry "java -Xmx16g -Djavaio.tmpdir=/short/a32/working_temp -jar $PICARD_TOOLS_DIR/picard.jar MarkDuplicates INPUT=output.bam OUTPUT=output.marked.bam METRICS_FILE=metrics CREATE_INDEX=true VALIDATION_STRINGENCY=LENIENT
"

Hope you can help and thanks for your assistance in advance.
Eddie

Issue · Github
by Sheila

Issue Number
160
State
closed
Last Updated
Milestone
Array
Closed By
chandrans

Comments

  • SheilaSheila Broad InstituteMember, Broadie admin

    @ipeddie
    Hi Eddie,

    @thibault helped me out here. It turns out you need to set the Picard option for TMP_DIR. Have a look at this page: https://broadinstitute.github.io/picard/command-line-overview.html under Standard Options. Right now, you are setting the Java option for temp dirs, which MarkDuplicates does not use.

    -Sheila

  • ipeddieipeddie Sydney, AustraliaMember

    Hi Sheila,

    thanks for the reply.

    So my command line should read :

    "java -jar $PICARD_TOOLS_DIR/picard.jar MarkDuplicates INPUT=output.bam OUTPUT=output.marked.bam METRICS_FILE=metrics CREATE_INDEX=true VALIDATION_STRINGENCY=LENIENT TMP_DIR=/short/a32/working_temp"

    by the way, can you explain what "before spilling to disk." actually mean in the INFO statement. I'm still a bit confused by that.

    Thanks..
    Eddie

  • SheilaSheila Broad InstituteMember, Broadie admin

    @ipeddie
    Hi Eddie,

    I had to ask Joel again for some help with this one.

    It turns out there is a data structure that takes up a lot of RAM, so it stores some of that on disk temporarily. That's why it needs a temp directory. The only thing you need to worry about is supplying a temp directory with enough space, which is what you have done correctly with your above command.

    -Sheila

  • jingmengjingmeng AustraliaMember

    Hi, I got the same error of No space left on device. And I changed my command to

    java -jar ~/picard.jar MarkDuplicates INPUT=~/input.bam METRICS_FILE=~/duplication_metrics OUTPUT=~/dedup.bam TMP_DIR = ~/Desktop

    And I got a new error:

    ERROR: Invalid argument 'TMP_DIR'.

    The picard version I use is Version: 2.18.9.

    Can you please help me fix this error? Thank you!

  • ipeddieipeddie Sydney, AustraliaMember

    Hi Jingmeng,

    I noticed in your command that you have spaces between TMP_DIR^=^~/Desktop. When I tried my command with spaces in the same position it also gives me an error.

    VALIDATION_STRINGENCY=LENIENT TMP_DIR = /g/data3/WGS/temp/
    ERROR: Invalid argument 'TMP_DIR'.

    but if I remove the spaces then the command runs.

    Could this be the cause of your error?

    Eddie

  • jingmengjingmeng AustraliaMember

    Thanks Eddie for your reply. I moved the space and now it works.

  • jejacobs23jejacobs23 Portland, ORMember

    **Hello, I get a similar error when running RevertSam. My code is as follows:
    **
    srun /usr/bin/java -jar $PICARD_DIR/picard.jar RevertSam \
    I=$INPUT \
    O=$OUTPUT_DIR/unmapped_BAM.bam \
    SANITIZE=true \
    MAX_DISCARD_FRACTION=0.005 \
    ATTRIBUTE_TO_CLEAR=XS \
    ATTRIBUTE_TO_CLEAR=XA \
    SORT_ORDER=queryname \
    RESTORE_ORIGINAL_QUALITIES=true \
    REMOVE_DUPLICATE_INFORMATION=true \
    REMOVE_ALIGNMENT_INFORMATION=true \
    TMP_DIR=$COMMON_DIR/submit_scripts/osteo_Workflow/working_temp

    And I get the following error:

    Runtime.totalMemory()=2022178816
    To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
    Exception in thread "main" htsjdk.samtools.util.RuntimeIOException: java.io.IOException: Disk quota exceeded
            at htsjdk.samtools.util.SortingCollection.spillToDisk(SortingCollection.java:246)
            at htsjdk.samtools.util.SortingCollection.add(SortingCollection.java:166)
            at picard.sam.RevertSam$RevertSamSorter.add(RevertSam.java:633)
            at picard.sam.RevertSam.doWork(RevertSam.java:254)
            at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:208)
            at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:95)
            at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:105)
    Caused by: java.io.IOException: Disk quota exceeded
            at java.io.FileOutputStream.writeBytes(Native Method)
            at java.io.FileOutputStream.write(FileOutputStream.java:326)
            at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
            at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
            at org.xerial.snappy.SnappyOutputStream.dump(SnappyOutputStream.java:127)
            at org.xerial.snappy.SnappyOutputStream.flush(SnappyOutputStream.java:100)
            at org.xerial.snappy.SnappyOutputStream.close(SnappyOutputStream.java:137)
            at htsjdk.samtools.util.SortingCollection.spillToDisk(SortingCollection.java:237)
            ... 6 more
    srun: error: exanode-3-1: task 0: Exited with exit code 1
    

    Is this error also related to the TMP_DIR issue?

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @jejacobs23,

    It appears that you've run out of disk space. Can you check how much storage you have on your disk(s) and confirm there is at least the size of your BAM file remaining?

Sign In or Register to comment.