We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

The UMIAwareMarkDuplicatesWithMateCigar seems to mark more duplicates when considering UMIs.

ddorsetddorset HudsonAlpha Institute for BiotechnologyMember
edited September 2018 in Ask the GATK team

I recently processed some data containing UMIs. I leaned heavily on the Picard Suite to accomplish this, and it's been very useful. Briefly, I followed the IDT guide for demultiplex/analysis. It uses the ExtractIlluminaBarcodes, IlluminaBaseCallsToSam, SamToFastq, and MergeBamAlignment tools.

I successfully generated aligned BAM files which contained the UMI data in an RX field. I ran UmiAwareMarkDuplicatesWithMateCigar and if I'm understanding correctly, it's finding more duplicates when considering the UMIs. My expectation was that it would find fewer duplicates, as reads having the same alignment location would only actually be considered duplicates if they also had the same associated UMI. I always thought that was one of the main justifications for using UMIs. I reviewed the code for the tool; indeed it seems that the intention is to find more duplicates, not fewer. Am I misunderstanding how this tool works?

Here's an example from my results set. The DUPLICATE_SETS_WITH_UMI number is consistently slightly larger than DUPLICATE_SETS_IGNORING_UMI:

    4403-ALJ-0024   9   262120  262117  625229  105498813   106735049   8.892751    8.89272 33  0.00001
    4403-ALJ-0025   9   262133  262129  2090883 293127960   301740919   8.88121 8.881199    33  0.00001
    4403-ALJ-0027   9   262119  262117  1354559 155668768   158739374   8.639415    8.639524    32  0.00001
    4403-ALJ-0028   9   262099  262091  599086  64859884    65484933    8.816336    8.816389    32  0.00001
    4403-ALJ-0029   9   262066  262027  2246668 204104680   208888854   8.898996    8.899   32  0.00001
    4403-ALJ-0030   9   262138  262133  2004581 194976450   199444272   8.871475    8.871538    32  0.00001
    4403-ALJ-0031   9   262130  262128  1560639 154524908   157833745   8.711484    8.711501    32  0.00001
    4403-ALJ-0032   9   262112  262096  1763048 187378359   191778929   8.905295    8.905289    32  0.00001
Post edited by ddorset on

Best Answer


  • jishuxujishuxu BWHMember, Broadie

    Hi here,

    it is also what I noticed. but my samples are more balanced, 2 out of 4 samples have increased duplication with DUPLICATE_SETS_WITH_UMI and another 2 have decreased duplication with DUPLICATE_SETS_WITH_UMI. I guess it might be data specific behavior?

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @ddorset and @jishuxu,

    Sheila has moved on to greener pastures and we have a new front-line support specialist who is ramping up. In the meanwhile, I am helping out on the forum. I have asked a developer to look into your questions on UMIAwareMarkDuplicates, as I am unfamiliar with the tool. Thanks for your patience.

  • sahujasahuja Member
    Is there a way to get a list of the duplicate sets ? or a list of UMIs?
  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi ,

    The GATK support team is focused on resolving questions about GATK tool-specific errors and abnormal/erroneous results from the tools. For all other questions, such as this one, we are building a backlog to work through when we have the capacity.

    Please continue to post your questions because we will be mining them for improvements to documentation, resources, and tools.

    We cannot guarantee a reply, however, we ask other community members to help out if you know the answer.

    For context, see this [announcement](https://software.broadinstitute.org/gatk/blog?id=24419 “announcement”) and check out our [support policy](https://gatkforums.broadinstitute.org/gatk/discussion/24417/what-types-of-questions-will-the-gatk-frontline-team-answer/p1?new=1 “support policy”).

Sign In or Register to comment.