If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra

Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

The UMIAwareMarkDuplicatesWithMateCigar seems to mark more duplicates when considering UMIs.

ddorsetddorset HudsonAlpha Institute for BiotechnologyMember
edited September 2018 in Ask the GATK team

I recently processed some data containing UMIs. I leaned heavily on the Picard Suite to accomplish this, and it's been very useful. Briefly, I followed the IDT guide for demultiplex/analysis. It uses the ExtractIlluminaBarcodes, IlluminaBaseCallsToSam, SamToFastq, and MergeBamAlignment tools.

I successfully generated aligned BAM files which contained the UMI data in an RX field. I ran UmiAwareMarkDuplicatesWithMateCigar and if I'm understanding correctly, it's finding more duplicates when considering the UMIs. My expectation was that it would find fewer duplicates, as reads having the same alignment location would only actually be considered duplicates if they also had the same associated UMI. I always thought that was one of the main justifications for using UMIs. I reviewed the code for the tool; indeed it seems that the intention is to find more duplicates, not fewer. Am I misunderstanding how this tool works?

Here's an example from my results set. The DUPLICATE_SETS_WITH_UMI number is consistently slightly larger than DUPLICATE_SETS_IGNORING_UMI:

    4403-ALJ-0024   9   262120  262117  625229  105498813   106735049   8.892751    8.89272 33  0.00001
    4403-ALJ-0025   9   262133  262129  2090883 293127960   301740919   8.88121 8.881199    33  0.00001
    4403-ALJ-0027   9   262119  262117  1354559 155668768   158739374   8.639415    8.639524    32  0.00001
    4403-ALJ-0028   9   262099  262091  599086  64859884    65484933    8.816336    8.816389    32  0.00001
    4403-ALJ-0029   9   262066  262027  2246668 204104680   208888854   8.898996    8.899   32  0.00001
    4403-ALJ-0030   9   262138  262133  2004581 194976450   199444272   8.871475    8.871538    32  0.00001
    4403-ALJ-0031   9   262130  262128  1560639 154524908   157833745   8.711484    8.711501    32  0.00001
    4403-ALJ-0032   9   262112  262096  1763048 187378359   191778929   8.905295    8.905289    32  0.00001
Post edited by ddorset on

Best Answer


  • jishuxujishuxu BWHMember, Broadie

    Hi here,

    it is also what I noticed. but my samples are more balanced, 2 out of 4 samples have increased duplication with DUPLICATE_SETS_WITH_UMI and another 2 have decreased duplication with DUPLICATE_SETS_WITH_UMI. I guess it might be data specific behavior?

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @ddorset and @jishuxu,

    Sheila has moved on to greener pastures and we have a new front-line support specialist who is ramping up. In the meanwhile, I am helping out on the forum. I have asked a developer to look into your questions on UMIAwareMarkDuplicates, as I am unfamiliar with the tool. Thanks for your patience.

Sign In or Register to comment.