the order of merge and mark duplicate

I have a whole genome sequencing sample.
That consist of 1fastq file per lane.
That consist of multiple file per sample that produced per lane.

After I merge multiple bams, I progress MarkDuplicates using Picard.
But MarkDuplicates is very slow.

So I want to progress MarkDupllicate using bam per lane, then merge bam files.

I wonder whether the order of merge and MarkDuplicate affect post-analysis?

Best Answer

Answers

  • SunhyeSunhye KoreaMember

    Thanks Sheila!

  • falkerfalker GermanyMember

    I am already done with my GATK best practice analysis and just realized, that I ran Mark Duplicates on each read from each lane belonging to a sample.

    I'm afraid this is not the right way. I have to at least run Mark Duplicates per lane or can I trust my variant calling having it done that way?

  • falkerfalker GermanyMember
    edited February 27

    I can answer that question myself for anybody who is interested on the impact of merging while running MarkDuplicates:

    Dataset ~20x coverage, 84 single u.bam files

    MarkDuplicates joint run (no. of duplicates) : 8610548
    MarkDuplicates single run (merging of files after MarkDuplicates: 547105

    So in that case, 17x more Duplicates found when running all files at once.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator
    edited March 1

    @falker
    Hi,

    Thanks for sharing. Perhaps this thread and this article will help as well.

    -Sheila

    EDIT: This one too :smile:

Sign In or Register to comment.