Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
What is MarkDuplicates doing after it reports "MarkDuplicates done"?
I had a transient file system error that occurred while several MarkDuplicates jobs were running, and I'm trying to figure out if there is anyway to recover without rerunning the jobs from scratch.
Specifically, what I see in the output for each job is that it reports
INFO 2016-02-29 00:02:13 MarkDuplicates Written 670,000,000 records. Elapsed time: 04:52:49s. Time for last 10,000,000: 251s. Last read position: chr10:85,021,664
[Mon Feb 29 00:12:24 EST 2016] picard.sam.markduplicates.MarkDuplicates done. Elapsed time: 862.99 minutes.
... a couple irrelevant lines, then
Exception in thread "main" htsjdk.samtools.util.RuntimeIOException: Read error; BinaryCodec in readmode; file: .../alignments/SRR371622.07.bam
...a bunch of traceback lines
SRR371622.07.bam is one of the input files, and does exist in that directory. I've verified with the sysadmin that system logs show some sort of file system error occurred at around that time.
My command is like this
inputArgs=" ... INPUT=alignments/SRR371622.07.bam ... "
.../java -jar .../picard.jar MarkDuplicates \
Each jobs has left behind the intended output file XXX.dmarked.bam, and some jobs appear to have left behind a directory fill of temporary files in temp_dmarked.
That the last read position is reported as chr10:85,021,664 is worrisome; my reference is chimp.
So, my questions:
(1) Is there any hope that the created bam files are useful? Is there any easy way to verify what's in them?
(2) Failing that, is the information in the temp files useful?
(3) If Picard encountered a problem reading files before it reported "MarkDuplicates done", would it report the error? My concern is that there's only a ten minute lapse between chr10:85,021,664 and "MarkDuplicates done", which, unless chr10 is the last chromosome it's processing, seems suspiciously short.
(4) Am I correct that MarkDuplicates doesn't support operation on intervals? I'd like to be able to break this up into many subintervals so that when something fails I don't lose as much. But as near as I can tell MarkDuplicates doesn't support doing this.
Thanks for any help or suggestions.