The frontline support team will be unavailable to answer questions on April 15th and 17th 2019. We will be back soon after. Thank you for your patience and we apologize for any inconvenience!

the order of merge and mark duplicate

SunhyeSunhye KoreaMember

I have a whole genome sequencing sample.
That consist of 1fastq file per lane.
That consist of multiple file per sample that produced per lane.

After I merge multiple bams, I progress MarkDuplicates using Picard.
But MarkDuplicates is very slow.

So I want to progress MarkDupllicate using bam per lane, then merge bam files.

I wonder whether the order of merge and MarkDuplicate affect post-analysis?

Best Answer


  • SunhyeSunhye KoreaMember

    Thanks Sheila!

  • falkerfalker GermanyMember

    I am already done with my GATK best practice analysis and just realized, that I ran Mark Duplicates on each read from each lane belonging to a sample.

    I'm afraid this is not the right way. I have to at least run Mark Duplicates per lane or can I trust my variant calling having it done that way?

  • falkerfalker GermanyMember
    edited February 2018

    I can answer that question myself for anybody who is interested on the impact of merging while running MarkDuplicates:

    Dataset ~20x coverage, 84 single u.bam files

    MarkDuplicates joint run (no. of duplicates) : 8610548
    MarkDuplicates single run (merging of files after MarkDuplicates: 547105

    So in that case, 17x more Duplicates found when running all files at once.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin
    edited March 2018


    Thanks for sharing. Perhaps this thread and this article will help as well.


    EDIT: This one too :smile:

Sign In or Register to comment.