the order of merge and mark duplicate

SunhyeSunhye KoreaMember

I have a whole genome sequencing sample.
That consist of 1fastq file per lane.
That consist of multiple file per sample that produced per lane.

After I merge multiple bams, I progress MarkDuplicates using Picard.
But MarkDuplicates is very slow.

So I want to progress MarkDupllicate using bam per lane, then merge bam files.

I wonder whether the order of merge and MarkDuplicate affect post-analysis?

Best Answer


  • SunhyeSunhye KoreaMember

    Thanks Sheila!

  • falkerfalker GermanyMember

    I am already done with my GATK best practice analysis and just realized, that I ran Mark Duplicates on each read from each lane belonging to a sample.

    I'm afraid this is not the right way. I have to at least run Mark Duplicates per lane or can I trust my variant calling having it done that way?

  • falkerfalker GermanyMember
    edited February 2018

    I can answer that question myself for anybody who is interested on the impact of merging while running MarkDuplicates:

    Dataset ~20x coverage, 84 single u.bam files

    MarkDuplicates joint run (no. of duplicates) : 8610548
    MarkDuplicates single run (merging of files after MarkDuplicates: 547105

    So in that case, 17x more Duplicates found when running all files at once.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin
    edited March 2018


    Thanks for sharing. Perhaps this thread and this article will help as well.


    EDIT: This one too :smile:

Sign In or Register to comment.