To celebrate the release of GATK 4.0, we are giving away free credits for running the GATK4 Best Practices pipelines in FireCloud, our secure online analysis portal. It’s first come first serve, so sign up now to claim your free credits worth $250. Sponsored by Google Cloud. Learn more at https://software.broadinstitute.org/firecloud/documentation/freecredits

MarkDuplicates vs. MarkduplicatesWithMateCigar: what is the best practice?

Hello,

I am curious to know which tool (MarkDuplicates or MarkduplicatesWithMateCigar) would people advice for marking duplicates (I am following GATK's Best Practices from paired-end DNA reads). I get that MarkduplicatesWithMateCigar also uses CIGAR infos to mark them, but struggle justifying the supposed added value of using this tool vs. the "regular" MarkDuplicates.

Also, selecting representative nondups based on the total mapped length of a pair (ie, MarkduplicatesWithMateCigar method) rather than on the sum of base qualities of a pair (ie, Markduplicates method) seems more intuitive to me. Why would one prefer the latter over the former?

Ben

Best Answer

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @benjaminpelissie
    Hi Ben,

    I think this article will answer your questions.

    -Sheila

  • benjaminpelissiebenjaminpelissie Madison, WIMember

    Thank you @Sheila.

    I read this article a few times already but can't get a precise idea about my questions. Nothing tells me if using CIGAR infos to mark duplicates makes MarkDuplicatesWithMateCigar a better tool than MarkDuplicates. Moreover, it is written that "As a consequence of using the mate's CIGAR string (provided by the MC tag), MarkDuplicatesWithMateCigar can only prioritize the total mapped reference length". So is it more a constraint than a feature of the tool? How does that compare (in terms of reliability in choosing representative nondups) to using the sum of base qualities of a pair? On samtools' forum (SourceForge), developers wrote that the two tools were mostly "equivalent", but with no further information. In other threads (mostly on github), we can read further about MarkDuplicatesWithMateCigar's limitations (eg. "it cannot know the program records contained in the file that should be chained in advance"), but almost nothing about its strengths compared to MarkDuplicates.

    All in all, if I have the choice, which approach should I use? If there no better tool, each having its specific pros and cons, what are those pros and cons?

    It might sound strait-forward for many users, but I personally am confused. Thanks for your great help.

    Ben

  • benjaminpelissiebenjaminpelissie Madison, WIMember

    Thank you very much Geraldine, that really helps! Then I'll go with MarkDuplicates, since I (at least in a first time) don't want to be too adventurous ;) Maybe giving those specifications in tutorial #6747 would remove this uncertainty, at least for people who (like me) would like to choose the "safe" option (?).

  • shleeshlee CambridgeMember, Broadie, Moderator
    edited May 2016

    I've added a section in the introduction of the tutorial titled "Which tool should I use, MarkDuplicates or MarkDuplicatesWithMateCigar?" that (i) highlights the considerations I mention in the body of the tutorial. Also, (ii) I've pointed out that MarkDuplicates is what the Broad Genomics Platform uses and that MarkDuplicatesWithMateCigar is a newerish tool. Let me know @benjaminpelissie and @Geraldine_VdAuwera if this new section is unclear.

  • benjaminpelissiebenjaminpelissie Madison, WIMember

    I would just use the term "nondup" instead of "representative insert" (or at least define "nondup" at this point), as it is the term employed in sections 3. and 4.. Otherwise it is perfect. Far less ambiguous. Thank you very much @shlee.

Sign In or Register to comment.