Heads up:
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!
Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
MarkDuplicates - 0 pairs never matched

Hello, I am trying to use MarkDuplicates in order to combine uBAMs generated from paired fastq files across two lanes (WGS on Illumina NovaSeq) using the GATK paired-fastq-to-unmapped-bam.wdl. I believe I have parsed the correct RG information in generating the uBAMs (ValidateSamFile passed) but I am a little confused as to the MarkDuplicates output. I have three, likely related questions.
1) What is the meaning of "Tracking 0 as yet unmatched pairs" and "0 pairs never matched"?
2) There was also a WARNING - what is meant by "Reading will be unbuffered"?
```
INFO 2019-07-28 10:15:16 MarkDuplicates Read 506,000,000 records. Elapsed time: 07:30:29s. Time for last 1,000,000: 43s. Last read position: chr1:102,335,300
INFO 2019-07-28 10:15:16 MarkDuplicates Tracking 0 as yet unmatched pairs. 0 records in RAM.
INFO 2019-07-28 10:15:47 MarkDuplicates Read 506691598 records. 0 pairs never matched.
INFO 2019-07-28 10:16:24 MarkDuplicates After buildSortedReadEndLists freeMemory: 16451310808; totalMemory: 16540762112; maxMemory: 16540762112
INFO 2019-07-28 10:16:24 MarkDuplicates Will retain up to 516898816 duplicate indices before spilling to disk.
INFO 2019-07-28 10:16:25 MarkDuplicates Traversing read pair information and detecting duplicates.
INFO 2019-07-28 10:16:25 SortingCollection Creating merging iterator from 5 files
WARNING 2019-07-28 10:16:27 SortingCollection There is not enough memory per file for buffering. Reading will be unbuffered.
```
3) Finally, the workflow ends by transitioning to a terminal state (return code not 0), yet it appears MarkDuplicates is "done".
```
INFO 2019-07-28 12:27:32 MarkDuplicates Before output close freeMemory: 8443873968; totalMemory: 13359382528; maxMemory: 14944829440
INFO 2019-07-28 12:27:33 MarkDuplicates After output close freeMemory: 13288318136; totalMemory: 13332643840; maxMemory: 14944829440
[Sun Jul 28 12:27:33 CDT 2019] picard.sam.markduplicates.MarkDuplicates done. Elapsed time: 582.83 minutes.
Runtime.totalMemory()=13332643840
[2019-07-28 12:32:26,21] [info] WorkflowManagerActor WorkflowActor-69a61389-a401-4e87-af57-66a767c7b994 is in a terminal state: WorkflowFailedState
[2019-07-28 12:32:37,63] [info] SingleWorkflowRunnerActor workflow finished with status 'Failed'.
[2019-07-28 12:32:40,80] [info] Workflow polling stopped
[2019-07-28 12:32:40,84] [info] Shutting down WorkflowStoreActor - Timeout = 5 seconds
[2019-07-28 12:32:40,84] [info] 0 workflows released by cromid-95ee9e9
[2019-07-28 12:32:40,85] [info] Shutting down WorkflowLogCopyRouter - Timeout = 5 seconds
[2019-07-28 12:32:40,88] [info] Shutting down JobExecutionTokenDispenser - Timeout = 5 seconds
[2019-07-28 12:32:40,90] [info] JobExecutionTokenDispenser stopped
[2019-07-28 12:32:40,90] [info] Aborting all running workflows.
[2019-07-28 12:32:40,99] [info] WorkflowStoreActor stopped
[2019-07-28 12:32:41,00] [info] WorkflowLogCopyRouter stopped
[2019-07-28 12:32:41,00] [info] Shutting down WorkflowManagerActor - Timeout = 3600 seconds
[2019-07-28 12:32:41,00] [info] WorkflowManagerActor All workflows finished
[2019-07-28 12:32:41,00] [info] WorkflowManagerActor stopped
[2019-07-28 12:32:42,32] [info] Connection pools shut down
[2019-07-28 12:32:42,32] [info] Shutting down SubWorkflowStoreActor - Timeout = 1800 seconds
[2019-07-28 12:32:42,32] [info] Shutting down JobStoreActor - Timeout = 1800 seconds
[2019-07-28 12:32:42,32] [info] Shutting down CallCacheWriteActor - Timeout = 1800 seconds
[2019-07-28 12:32:42,32] [info] Shutting down ServiceRegistryActor - Timeout = 1800 seconds
[2019-07-28 12:32:42,32] [info] SubWorkflowStoreActor stopped
[2019-07-28 12:32:42,32] [info] Shutting down DockerHashActor - Timeout = 1800 seconds
[2019-07-28 12:32:42,32] [info] Shutting down IoProxy - Timeout = 1800 seconds
[2019-07-28 12:32:42,32] [info] KvWriteActor Shutting down: 0 queued messages to process
[2019-07-28 12:32:42,32] [info] WriteMetadataActor Shutting down: 0 queued messages to process
[2019-07-28 12:32:42,32] [info] CallCacheWriteActor Shutting down: 0 queued messages to process
[2019-07-28 12:32:42,32] [info] CallCacheWriteActor stopped
[2019-07-28 12:32:42,32] [info] IoProxy stopped
[2019-07-28 12:32:42,33] [info] JobStoreActor stopped
[2019-07-28 12:32:42,33] [info] ServiceRegistryActor stopped
[2019-07-28 12:32:42,33] [info] DockerHashActor stopped
[2019-07-28 12:32:42,35] [info] Database closed
[2019-07-28 12:32:42,35] [info] Stream materializer shut down
[2019-07-28 12:32:42,35] [info] WDL HTTP import resolver closed
```
------------------------------------------------
I am using the gatk4-data-processing workflow with cromwell 40. I was unable to post the entirety of the workflow output and the code blocks do not seem to be working for me - I am not sure why as I have never posted before!
1) What is the meaning of "Tracking 0 as yet unmatched pairs" and "0 pairs never matched"?
2) There was also a WARNING - what is meant by "Reading will be unbuffered"?
```
INFO 2019-07-28 10:15:16 MarkDuplicates Read 506,000,000 records. Elapsed time: 07:30:29s. Time for last 1,000,000: 43s. Last read position: chr1:102,335,300
INFO 2019-07-28 10:15:16 MarkDuplicates Tracking 0 as yet unmatched pairs. 0 records in RAM.
INFO 2019-07-28 10:15:47 MarkDuplicates Read 506691598 records. 0 pairs never matched.
INFO 2019-07-28 10:16:24 MarkDuplicates After buildSortedReadEndLists freeMemory: 16451310808; totalMemory: 16540762112; maxMemory: 16540762112
INFO 2019-07-28 10:16:24 MarkDuplicates Will retain up to 516898816 duplicate indices before spilling to disk.
INFO 2019-07-28 10:16:25 MarkDuplicates Traversing read pair information and detecting duplicates.
INFO 2019-07-28 10:16:25 SortingCollection Creating merging iterator from 5 files
WARNING 2019-07-28 10:16:27 SortingCollection There is not enough memory per file for buffering. Reading will be unbuffered.
```
3) Finally, the workflow ends by transitioning to a terminal state (return code not 0), yet it appears MarkDuplicates is "done".
```
INFO 2019-07-28 12:27:32 MarkDuplicates Before output close freeMemory: 8443873968; totalMemory: 13359382528; maxMemory: 14944829440
INFO 2019-07-28 12:27:33 MarkDuplicates After output close freeMemory: 13288318136; totalMemory: 13332643840; maxMemory: 14944829440
[Sun Jul 28 12:27:33 CDT 2019] picard.sam.markduplicates.MarkDuplicates done. Elapsed time: 582.83 minutes.
Runtime.totalMemory()=13332643840
[2019-07-28 12:32:26,21] [info] WorkflowManagerActor WorkflowActor-69a61389-a401-4e87-af57-66a767c7b994 is in a terminal state: WorkflowFailedState
[2019-07-28 12:32:37,63] [info] SingleWorkflowRunnerActor workflow finished with status 'Failed'.
[2019-07-28 12:32:40,80] [info] Workflow polling stopped
[2019-07-28 12:32:40,84] [info] Shutting down WorkflowStoreActor - Timeout = 5 seconds
[2019-07-28 12:32:40,84] [info] 0 workflows released by cromid-95ee9e9
[2019-07-28 12:32:40,85] [info] Shutting down WorkflowLogCopyRouter - Timeout = 5 seconds
[2019-07-28 12:32:40,88] [info] Shutting down JobExecutionTokenDispenser - Timeout = 5 seconds
[2019-07-28 12:32:40,90] [info] JobExecutionTokenDispenser stopped
[2019-07-28 12:32:40,90] [info] Aborting all running workflows.
[2019-07-28 12:32:40,99] [info] WorkflowStoreActor stopped
[2019-07-28 12:32:41,00] [info] WorkflowLogCopyRouter stopped
[2019-07-28 12:32:41,00] [info] Shutting down WorkflowManagerActor - Timeout = 3600 seconds
[2019-07-28 12:32:41,00] [info] WorkflowManagerActor All workflows finished
[2019-07-28 12:32:41,00] [info] WorkflowManagerActor stopped
[2019-07-28 12:32:42,32] [info] Connection pools shut down
[2019-07-28 12:32:42,32] [info] Shutting down SubWorkflowStoreActor - Timeout = 1800 seconds
[2019-07-28 12:32:42,32] [info] Shutting down JobStoreActor - Timeout = 1800 seconds
[2019-07-28 12:32:42,32] [info] Shutting down CallCacheWriteActor - Timeout = 1800 seconds
[2019-07-28 12:32:42,32] [info] Shutting down ServiceRegistryActor - Timeout = 1800 seconds
[2019-07-28 12:32:42,32] [info] SubWorkflowStoreActor stopped
[2019-07-28 12:32:42,32] [info] Shutting down DockerHashActor - Timeout = 1800 seconds
[2019-07-28 12:32:42,32] [info] Shutting down IoProxy - Timeout = 1800 seconds
[2019-07-28 12:32:42,32] [info] KvWriteActor Shutting down: 0 queued messages to process
[2019-07-28 12:32:42,32] [info] WriteMetadataActor Shutting down: 0 queued messages to process
[2019-07-28 12:32:42,32] [info] CallCacheWriteActor Shutting down: 0 queued messages to process
[2019-07-28 12:32:42,32] [info] CallCacheWriteActor stopped
[2019-07-28 12:32:42,32] [info] IoProxy stopped
[2019-07-28 12:32:42,33] [info] JobStoreActor stopped
[2019-07-28 12:32:42,33] [info] ServiceRegistryActor stopped
[2019-07-28 12:32:42,33] [info] DockerHashActor stopped
[2019-07-28 12:32:42,35] [info] Database closed
[2019-07-28 12:32:42,35] [info] Stream materializer shut down
[2019-07-28 12:32:42,35] [info] WDL HTTP import resolver closed
```
------------------------------------------------
I am using the gatk4-data-processing workflow with cromwell 40. I was unable to post the entirety of the workflow output and the code blocks do not seem to be working for me - I am not sure why as I have never posted before!
Tagged:
Answers
Hi @jonahcullen , sorry for the delay. I will get back to you soon with more information.
If anyone else can chime in, please do!
Hi @jonahcullen ,
The tool scans along looking for pairs, and sometimes it finds one but it's mate isn't nearby so it builds up a collection of unmated ones that need to be matched.
I see
WARNING 2019-07-28 10:16:27 SortingCollection There is not enough memory per file for buffering. Reading will be unbuffered.
Can you increase your memory in the WDL for this task to see if that makes this go away?
Hi @Tiffany_at_Broad, thank you for your response. I reran this workflow with 250Gb of RAM and got the same
WARNING
. I am also including the error from the top of the log filehs_err_pid9437.log
as perhaps this could give more clues as the issue?Interestingly, I was able to successfully run the entire workflow using only a random subset of reads from the original fastq files. Seems like the issue is Java or memory related but I cannot imagine how it would require more than 250 Gb?
Sorry I forgot to include the
WARNING
portion:Are you able to attach the log file and WDL script you are using (or point me to the right version of the WDL from wherever you got it online)? If you are running this on Terra or FireCloud, let me know because you could just share the workspace.
Hi @Tiffany_at_Broad, I am attaching the log file and WDL script which is a very lightly modified (e.g. removed docker runtime params) version of
processing-for-variant-discovery-gatk4.wdl
from https://github.com/gatk-workflows/. I had to change the file extension to.wdl.txt
as the file format was not allowed for upload. Currently I am only testing this workflow on a single node on a HPC cluster.Hi again @Tiffany_at_Broad, I decided to check if there was something wrong with the
MarkDuplicates
of the BAMs using GATK directly (i.e. outside of cromwell execution) via a job submission. I used the BAM files generated during the previous attempt that transitioned to a failed state immediately afterMarkDuplicates
(above).This worked fine with return code 0 and a valid BAM. Additionally, the
D06403.canfam3.duplicate_metrics
was identical between the failed workflow run and theMarkDuplicates
step only. Does this suggest there is something failing with cromwell attempting to start the next taskSortAndFixTasks
? I've attached the stdout.Apologies for the continued posts!
I just noticed something in the stdout where it appears
MarkDuplicates
is actually completed successfully ("Tool returned: 0") prior to a java error. I'm not sure how to interpret or address the error.Hi @jonahcullen can you share the log with the error in it? It would be good to see what is happening before the error.
Hi @Tiffany_at_Broad, I have attached the full log file (
processing-for-variant-discovery-gatk4.stdout.txt
). Line 202 is where the error appears to begin.I am also attaching the stderr (
stderr.txt
), stdout fromMarkDuplicates
(MarkDuplicates.stdout.txt
), and the java error report (hs_err_pid1649.log
) referenced in the stdout.Thank you again for your help! Please let me know if there is anything I can provide or do to help figure this out!
I am sorry this is taking so long for us to get back to you.
Something I noticed in the log was
Job PreProcessingForVariantDiscovery_GATK4.MarkDuplicates:NA:1 exited with return code **250** which has not been declared as a valid return code. See 'continueOnReturnCode' runtime attribute for more details
You could use continueOnReturnCode in the WDL. I am trying to find out more about what the 250 means. Stay tuned.
Also here is more info on MarkDuplicates memory requirements
Q: MarkDuplicates takes a long time. Is something wrong?
A: Not necessarily. MarkDuplicates is very memory-intensive. This is required in order to detect interchromosomal duplication. At Broad, we run MarkDuplicates with 2GB Java heap (java -Xmx2g) and 10GB hard memory limit. Some example run times:
An 8.6GB file (63M records) took about 1 hours
A 20GB file (133M records) took about 2.5 hours
Increasing the amount of RAM allocated to MarkDuplicates (the -Xmx value) will improve the speed somewhat. Note that if you have a hard memory limit, this needs to be increased also. One workaround for large datasets involves marking duplicates for different libraries independently and subsequently merging the marked files.
Hi @Tiffany_at_Broad, no problem at all!
I had considered using the
continueOnReturnCode: 250
runtime option, I was just a little hesitant as I could not find any information on what 250 is indicating...However, I just started it anyways so I'll let you know how that shakes out.Curious to see your results
Unfortunately it did not work...the next step in the workflow also had a return code of 250.
I'm rerunning the workflow with
continueOnReturnCode: [0, 250]
for all tasks...we'll see what that does!Are the results for MarkDuplicates the same as when you ran it locally? Are there any runtime limits you are coming up against imposed by the HPC cluster? Can you share the configuration file you are using? This doc is about specifying runtime attributes for your hpc tasks (in case you haven't seen it). I'm hoping we can help you figure this out soon!
Hi @Tiffany_at_Broad, I did use that doc (and others) to put together the configuration file. I attempted to run the workflow again and specifically ran
MarkDuplicates
using the backend I defined in the configuration file. This resulted in the same error where the actual tool returned 0 but a return code of 250 was reported to the workflow, thus ending the workflow. Has there been any luck in determining what 250 is indicating? This issue only occurs when runningMarkDuplicates
as part of a workflow and not when ran as a stand alone tool...Hi @jonahcullen

This Java error, means: Java VM dies with SIGBUS (0x7) when temp directory is full on linux.
So Java is crashing and that is why it is returning the 250.
1.) You can try pointing to a bigger temp directory somewhere on a disk with ample space by adding this argument to MarkDuplicates:
2.) Try reducing your heap size (use -Xmx in addition to -Xms) to increase the memory assigned to your node.
Let me know if one of these options helps?
Hi @Tiffany_at_Broad, thank you very much for that information - that makes complete sense! I started a workflow with the
--TMP_DIR
argument and will report back when it finishes.Could you provide an input on how I should allocate the heap size? For example, when using
MarkDuplicates
to process two bam files (each ~30gb), it took ~11 hours with-Xms4000m
and-Xmx32g
on a node with 250gb of memory. This attempt was prior to using--TMP_DIR
. I don't mind the amount of time it takes per se, but I worry I'm going to have memory issues as I attempt to process more than two bam files.Thanks again for all your help - I really appreciate it!
The workflow that included
--TMP_DIR
resulted in the same return code of 250. However, there were a couple differences now in thestderr
andstdout
(both attached). Thestdout
no longer contained# A fatal error has been detected by the Java Runtime Environment:
block of text, justTool returned: 0
. Additionally, thestderr
now contains a different error I have not seen forMarkDuplicates
before:Hi @jonahcullen - I will follow up with the GATK developers on this next week.
FYI, I looked at our preprocessing pipeline run in Terra which uses individual GCP virtual machines per task (so MarkDuplicates has its own) and we ran a 64 GB bam through the pipeline and requested 7GB of memory and heap using Xms to 4GB, and that ran just fine. In general, you can set the XMX equal to XMS or a bit larger than the XMS, but not too close to the memory you requested overall.
Thank you! I will try
-Xms4g
and-Xmx16g
and see how that runs.Hi @jonahcullen You didn't attach the logs in your last message. What is the latest?
Hi @Tiffany_at_Broad, sorry about that.
I'm attaching the
stderr
(stderr.TMP_DIR.txt
) from runningMarkDuplicates
as part of the processing workflow with-Xms4000m
and-Xmx32g
. This is the run I mentioned where the tool returned 0, the workflow return code was 250 (as per usual it seems!), but thestdout
(stdout.TMP_DIR.txt
) did not include the# A fatal error has been detected by the Java Runtime Environment:
. This was also the one where thestderr
containedI'm also attaching the
stderr
(stderr.MARK_DUP.txt
) andstdout
(stdout.MARK_DUP.txt
) of runningMarkDuplicates
as a stand alone with the same bam files used as part of the processing above. This was run with--TMP_DIR
in a location with TBs or disk space, on a 62GB node, with-Xms4g
and-Xmx10g
- I meant to do-Xmx16g
but must have mistyped. This run ALSO resulted in the# A fatal error has been detected by the Java Runtime Environment:
had a return code of 250 but the tool returned 0.Thank you again so much for your continued help with this!
Are you running inside a Docker on the node?
Not as of yet, just "locally" (or by spawning jobs as per the backend configuration). Eventually I will get it setup to run using singularity but have not attempted that aspect of yet.
hmm we are a little bit stumped. I am following up with a few folks to see if we can get any more ideas.
No problem @Tiffany_at_Broad, totally understand! I have some additional information...
I thought I'd try to run the
HaplotypeCaller
workflow using the bam generated from the workflow we've been discussing with theMarkDuplicates
/JRE issue. I did this by submitting each of ten intervals to the job scheduler (PBS) of my HPC. One of the ten failed with a familiar looking error in thestderr
andstdout
(both attached)...stderr
stdout
Hi @jonahcullen
Our devs think you may be hitting a weird bug (some sort of C++ bug). Can you try using a newer version of Java? If that doesn't help, can you try disabling the intel inflator/deflator by using these arguments while running MarkDuplicates:
--USE_JDK_DEFLATER
--USE_JDK_INFLATER
Hi @Tiffany_at_Broad, thank you very much for that information - I will absolutely try that. I'm not sure if the following will help support that theory but if I run a bam past
MarkDuplicates
that appears okay but had the return code of 250, I get the followingstdout
fromBaseRecalibrator
but all the return codes are 0:Hi @jonahcullen the return value of the tool is the number of reads processed. If this is the output from multiple runs on different parts of the same file then it is expected? If it is output from the same parts on the same file then that wouldn't be so good bc it wouldn't be giving different results multiple times.
Hi @Tiffany_at_Broad, I tried using a newer version java as well as the
--USE_JDK_DEFLATER
and--USE_JDK_INFLATER
arguments but unfortunately got the same 250 return code. It seems strange that is just this step that is getting messed up. Also even with the error, it seems to take a really long time, multiple days in some cases.Hi @jonahcullen ,
Can you recreate the issue and provide the hs log file (file path beneath the core dump one), exact commands you're running (or just confirm it is the same WDL if already attached), along with any other error logs? Are you open to trying this out on Terra?
Hi @Tiffany_at_Broad, I want to express my sincere gratitude for all the help you've provided attempting to solve this little issue.
I am attaching the following:
1. the complete (ie fastqs to gvcf)
wdl
file2.
json
3.
cromwell.conf
4. the
stderr
andstdout
fromMarkDuplicates
(andstdout.background
andstderr.background
from running via job submission to an HPC)5. duplicate metrics
I started the workflow with
I am using
gatk-4.1.1.0
.This particular iteration ran from fastqs to gvcf albeit with a return code of 250 from the
MarkDuplicates
step and without producing an hs log file (I could a different example with that output if needed). There was however a 16GBcore.25133
dump.The java version was
Although not for this particular sample, I have attempted to run this complete workflow using
as well as with the
--USE_JDK_DEFLATER
and--USE_JDK_INFLATER
options, both attempts resulted in the 250 return code and core dump.Hi @jonahcullen - I hope to dig into this more tomorrow. Have you considered using MarkDuplicatesSpark over MarkDuplicates ?
Hi @Tiffany_at_Broad, I have not tried
MarkDuplicatesSpark
but I am willing to try!Hi @jonahcullen , sorry for not getting back to you earlier. I hope you were able to progress either troubleshooting this or running with MarkDuplicatesSpark. I checked out your Cromwell config file and all looked good. You could check the RC.txt file in the directory to see how Cromwell worked out what the returncode for the task was. We still think MarkDuplicates is having some sort of c++ issue on your HPC with however the node itself is configured. You could try using Docker with this pipeline or replacing the tool with MarkDuplicatesSpark. Here's a useful doc. I hope this helps!
Hi @Tiffany_at_Broad, thank you for checking in! It's the weirdest thing but for the past few weeks,
MarkDuplicates
appears to be working - at least based on seeing 0 return codes. I started running samples in a new directory in scratch. It is still taking an excessive amount of time - anywhere from 2 days up to 10 days for some samples...I may tryMarkDuplicatesSpark
and see if that is a viable option for us.Awesome to hear! Good luck!