If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra

Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Get Picard duplication metrics without running MarkDuplicates?

Is it possible to retrieve Picard duplication metrics without running MarkDuplicates? I would like to get these metrics for a merged bam file where the original bams already have dups flagged, and did not go through the same PCR.

Best Answer


  • as_ubcas_ubc Member

    Thanks for the quick reply. Just ran EstimateLibraryComplexity on a bam that already went through MarkDuplicates, and got different results from the original duplication metrics. Should I be using non-default parameters?

  • KurtKurt ✭✭✭ Member ✭✭✭

    I'm just a user and I just used the defaults. Personally, I wasn't expecting to get the exact same results, but I felt the output was close enough to what they would have been had I ran markduplicates from a pipeline run and the difference would not have changed my opinion on the data (I think in my experience the difference was less than 1%).

  • as_ubcas_ubc Member

    Sorry, but that's not satisfactory for me. If you run the supposedly same program on the same data twice and get different results, I think there is reason to be concerned, unless the program is randomly sub-sampling the data to derive these metrics. Is this the case? If so, is there a way for me to fix the seed? I don't see anything in the documentation.

  • SheilaSheila admin Broad InstituteMember, Broadie, Moderator admin


    I'm not sure I understand what exactly you are trying to do. Are you trying to find out how many duplicates are in your bam file?


  • pdexheimerpdexheimer ✭✭✭✭ Member ✭✭✭✭

    But they're not the same program. They use radically different methods - the EstimateLibraryComplexity description starts with

    Attempts to estimate library complexity from sequence of read pairs alone. Does so by sorting all reads by the first N bases (5 by default) of each read and then comparing reads with the first N bases identical to each other for duplicates

    Which is very different than MarkDuplicates' method of using the alignments themselves to determine where the duplicates are.

    Most of the metrics in the MarkDuplicates metrics report are simply counting based, particularly if you already have duplicates marked. The only one that's more complex is the estimate of the library size - which is just an estimate, and therefore unreasonable to expect that you'll get exactly the same answer from several different methods. There's a thread on the samtools mailing list talking about the basis for those estimates if you really want to get down and dirty with it...

  • as_ubcas_ubc Member

    Hi Sheila and pdexheimer,

    Thanks very much for your replies.

    Shiela, what I was wondering is if it's possible to get the set of metrics found in the "duplication metrics" file produced by MarkDuplicates, but without having to run MarkDuplicates. This includes "UNPAIRED_READ_DUPLICATES", "READ_PAIR_DUPLICATES", "PERCENT_DUPLICATION", and "ESTIMATED_LIBRARY_SIZE".

    pdexheimer, thank you, that makes sense. Is there a different way to retrieve the metrics produced by MarkDuplicates? I have a pipeline that parses this output for plotting, so would prefer to run the same program, rather than retrieve the data in some other way, if possible.

    Thanks very much for your time.

  • as_ubcas_ubc Member

    Alright. Thanks for your response.

Sign In or Register to comment.