Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
Attention:
We will be out of the office on November 11th and 13th 2019, due to the U.S. holiday(Veteran's day) and due to a team event(Nov 13th). We will return to monitoring the GATK forum on November 12th and 14th respectively. Thank you for your patience.

What is MarkDuplicates doing after it reports "MarkDuplicates done"?

I had a transient file system error that occurred while several MarkDuplicates jobs were running, and I'm trying to figure out if there is anyway to recover without rerunning the jobs from scratch.

Specifically, what I see in the output for each job is that it reports
...
INFO 2016-02-29 00:02:13 MarkDuplicates Written 670,000,000 records. Elapsed time: 04:52:49s. Time for last 10,000,000: 251s. Last read position: chr10:85,021,664
[Mon Feb 29 00:12:24 EST 2016] picard.sam.markduplicates.MarkDuplicates done. Elapsed time: 862.99 minutes.
... a couple irrelevant lines, then
Exception in thread "main" htsjdk.samtools.util.RuntimeIOException: Read error; BinaryCodec in readmode; file: .../alignments/SRR371622.07.bam
...a bunch of traceback lines

SRR371622.07.bam is one of the input files, and does exist in that directory. I've verified with the sysadmin that system logs show some sort of file system error occurred at around that time.

My command is like this
inputArgs=" ... INPUT=alignments/SRR371622.07.bam ... "
maxFiles=$((80*ulimit -n/100))

.../java -jar .../picard.jar MarkDuplicates \
TMP_DIR=pwd/temp_dmarked \
${inputArgs} \
OUTPUT=alignments/XXX.dmarked.bam \
METRICS_FILE=alignments/XXX.dmarked.metrics.txt \
ASSUME_SORTED=true \
REMOVE_DUPLICATES=true \
READ_NAME_REGEX=null \
MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=${maxFiles}

Each jobs has left behind the intended output file XXX.dmarked.bam, and some jobs appear to have left behind a directory fill of temporary files in temp_dmarked.

That the last read position is reported as chr10:85,021,664 is worrisome; my reference is chimp.

So, my questions:
(1) Is there any hope that the created bam files are useful? Is there any easy way to verify what's in them?
(2) Failing that, is the information in the temp files useful?
(3) If Picard encountered a problem reading files before it reported "MarkDuplicates done", would it report the error? My concern is that there's only a ten minute lapse between chr10:85,021,664 and "MarkDuplicates done", which, unless chr10 is the last chromosome it's processing, seems suspiciously short.
(4) Am I correct that MarkDuplicates doesn't support operation on intervals? I'd like to be able to break this up into many subintervals so that when something fails I don't lose as much. But as near as I can tell MarkDuplicates doesn't support doing this.

Thanks for any help or suggestions.
Bob H

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi Bob,

    picard.sam.markduplicates.MarkDuplicates done just means that the core algorithm is done processing the data, but then there's still a bit of work to be done writing the output to file. It looks like something's going wrong somewhere during that operation. Possibly a file system glitch.

    To your specific questions:

    1. You can run ValidateSamFile to check the file's integrity; my bet is that it's going to be truncated.
    2. Unfortunately the temp files aren't salvageable, in the sense that you can't just run again and have the tool pick up where it left off.
    3. It would report any error that prevented it from reading in data as far as it's aware. There are problems that can happen without the program's awareness.
    4. Correct. As a workaround you could generate per-chromosome bam files and run MarkDups on those independently, then combine them again afterward.
  • BobHarrisBobHarris earthMember

    Thanks, the answers for 1-3 are pretty much what I feared.

    Per the answer to 4, what's the easiest way to generate per-chromosome bam files from 50..100 whole genome bam files? In fact, I actually only need one of the chromosomes downstream.

    I had explored eliminating the other chromosomes after mapping (I need to keep them during mapping), but at the time it looked like I was getting in deeper than I wanted.

    Sounds like there must be some tool that does the by-chromosome filtering?

  • BobHarrisBobHarris earthMember

    The reason> @BobHarris said:

    ... what's the easiest way to generate per-chromosome bam files from 50..100 whole genome bam files? In fact, I actually only need one of the chromosomes downstream.

    I've tried a couple of suggestions I found online, but as best I can tell they both had shortcomings for my intended downstream analysis, which is variant calling on one particular chromosome.

    One method was to use samtools view with each chromosome as an interval. Since I only want one chromosome this seems ideal. However, it has a couple shortcomings. First, it keeps all the other sequences in my header. So the variant caller (perhaps?) will think this means there was no coverage there, which (perhaps?) will have an adverse affect on any statistical metrics produced.

    Second, it doesn't keep the unmapped mates. I.e. one mate in a pair maps to chrW, and the other mate maps nowhere. My expectation is that the variant caller will want to know this, and potentially will care about the nts and quality scores for the unmapped mate. Or maybe it doesn't, but how can I know?

    Third, mates that map to different chromosomes seem to be handled the same as unmapped mates.

    The other method was bamtools split. This had all the same problems as the samtools method, but added the problem of extra files being created, one for every sequence in the file. For my use it has no advantage over the samtools method. And from past experience I've learned *nix can have difficulties with directories containing too many files.

    Is there discussion somewhere that would clue me in on how I would need to do this? Ideally I'd like to (a) reduce computation to a single chromosome as early in the pipeline as possible, and
    (b) get the same downstream results (after variant calling), for that chromosome, that are identical to what I would have gotten had I processed the entire genome at every step

    Am I making sense?

    Thanks,
    Bob H

  • SheilaSheila Broad InstituteMember, Broadie admin

    @BobHarris
    Hi Bob H,

    You can use PrintReads to get per-chromosome bam files.

    If you are only interested in one chromosome, you have two options.

    1) If you are worried about compute and time, you can simply run the Best Practices tools on the one chromosome bam file you are interested in. You will still have to use the reference with all the chromosomes in it, but you can use -L to specify which chromosome to run on. You are correct you cannot use VQSR however, because there will not be enough data. You can use hard filtering as a replacement for VQSR.

    2) If you have no concerns about compute and time, you can run the Best Practice tools on the entire genome all the way through VQSR. After VQSR, you can use SelectVariants to select the variants from your chromosome of interest.

    I hope this helps.

    -Sheila

  • BobHarrisBobHarris earthMember

    Thanks, Sheila. To clarify, to "run the Best Practices tools on the one chromosome bam file you are interested in", what is the best way to create the "one chromosome bam file?" In my previous post, I described two ways I tried to do this, and what my concerns are about whether downstream analysis (using best practices for variant calls) will be correct.

    Note that I'm currently stuck at MarkDuplicates, because (a) there appears to be no way to tell it to only use a single chromosome, (b) when it operates on the whole genome the long runtime increases the chance that a transient disk/network error will stop the job from finishing, and (c) MarkDuplicates has to start over from scratch every time, it can't recover from a failed job. So I am trying to work around some flaky hardware, trying to find a way to do so and still get the right results at the end.

    Bob H

  • SheilaSheila Broad InstituteMember, Broadie admin

    @BobHarris
    Hi Bob H,

    Yes, you can use PrintReads with -L "your chromosome of interest" to get the one chromosome only bam file.

    -Sheila

  • BobHarrisBobHarris earthMember

    Thanks, @Sheila . A quick test shows me that PrintReads -L will give me the same results as the other two methods.

    I'm concerned about whether the fact that these delete unmapped mates (and delete other mates) will have any harmful effect downstream when I use the UG variant caller (Mar/1 post) If I can get assurance that won't be an issue I'll be good to go.

    Bob H

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @BobHarris The variant caller won't care about the unmapped mates; they will be ignored. So that shouldn't be an obstacle. By the way, you should consider using HaplotypeCaller rather than UnifiedGenotyper. The latter is considered deprecated at this point; it yields much inferior results on indels.

    Regarding the sequence dictionary in the header, that is also not a problem. When you run the analysis, just specify the -L intervals the same way as when running PrintReads, and the caller will know to process only the interval where you have data. It would work fine even without doing that, but by doing it you'll get an additional speedup.

  • BobHarrisBobHarris earthMember

    Thanks @Geraldine_VdAuwera, that allays my concerns.

    Regarding HC vs UG, I have (and still am) considering it. The first stuff I'm running needs to use UG because I will be comparing my results to those from a paper which used UG on similar data -- a sanity check to make sure I've not screwed up somewhere in the pipeline.

    I'm pretty sure that in one of the best practices tutorials, or other GATK documentation, it said something to the effect "HC should give better results than UG, but we (Broad) are not using HC for our in-house pipelines yet). I took that as a deterrent from using HC. I don't have the URL for that, perhaps it was stale information?

    Thanks for your help,
    Bob H

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Oy, that's extremely stale. We've been using HC in production for two years now. If you can find the URL we'll fix that ASAP.

  • BobHarrisBobHarris earthMember

    I've looked through my notes, and can't find it. I've been through a large number of your tutorials, and videos for best practices over the past few weeks. Unfortunately I don't recall where I read that. Hopefully I'm not just imagining it!

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    No you didn't imagine it, I remember writing that. But it's definitely no longer applicable. That article was probably archived.

    In the near future we're going to add some explicit versioning to key documents and publicize version updates to the Best Practices more widely to avoid this kind of confusion.

Sign In or Register to comment.