To celebrate the release of GATK 4.0, we are giving away free credits for running the GATK4 Best Practices pipelines in FireCloud, our secure online analysis portal. It’s first come first serve, so sign up now to claim your free credits worth $250. Sponsored by Google Cloud. Learn more at

the order of merge and mark duplicate

I have a whole genome sequencing sample.
That consist of 1fastq file per lane.
That consist of multiple file per sample that produced per lane.

After I merge multiple bams, I progress MarkDuplicates using Picard.
But MarkDuplicates is very slow.

So I want to progress MarkDupllicate using bam per lane, then merge bam files.

I wonder whether the order of merge and MarkDuplicate affect post-analysis?

Best Answer


  • SunhyeSunhye KoreaMember

    Thanks Sheila!

  • falkerfalker GermanyMember

    I am already done with my GATK best practice analysis and just realized, that I ran Mark Duplicates on each read from each lane belonging to a sample.

    I'm afraid this is not the right way. I have to at least run Mark Duplicates per lane or can I trust my variant calling having it done that way?

  • falkerfalker GermanyMember
    edited February 27

    I can answer that question myself for anybody who is interested on the impact of merging while running MarkDuplicates:

    Dataset ~20x coverage, 84 single u.bam files

    MarkDuplicates joint run (no. of duplicates) : 8610548
    MarkDuplicates single run (merging of files after MarkDuplicates: 547105

    So in that case, 17x more Duplicates found when running all files at once.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator
    edited March 1


    Thanks for sharing. Perhaps this thread and this article will help as well.


    EDIT: This one too :smile:

Sign In or Register to comment.