We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

Get Picard duplication metrics without running MarkDuplicates?

Is it possible to retrieve Picard duplication metrics without running MarkDuplicates? I would like to get these metrics for a merged bam file where the original bams already have dups flagged, and did not go through the same PCR.

Best Answer


  • as_ubcas_ubc Member

    Thanks for the quick reply. Just ran EstimateLibraryComplexity on a bam that already went through MarkDuplicates, and got different results from the original duplication metrics. Should I be using non-default parameters?

  • KurtKurt Member ✭✭✭

    I'm just a user and I just used the defaults. Personally, I wasn't expecting to get the exact same results, but I felt the output was close enough to what they would have been had I ran markduplicates from a pipeline run and the difference would not have changed my opinion on the data (I think in my experience the difference was less than 1%).

  • as_ubcas_ubc Member

    Sorry, but that's not satisfactory for me. If you run the supposedly same program on the same data twice and get different results, I think there is reason to be concerned, unless the program is randomly sub-sampling the data to derive these metrics. Is this the case? If so, is there a way for me to fix the seed? I don't see anything in the documentation.

  • SheilaSheila Broad InstituteMember, Broadie ✭✭✭✭✭


    I'm not sure I understand what exactly you are trying to do. Are you trying to find out how many duplicates are in your bam file?


  • pdexheimerpdexheimer Member ✭✭✭✭

    But they're not the same program. They use radically different methods - the EstimateLibraryComplexity description starts with

    Attempts to estimate library complexity from sequence of read pairs alone. Does so by sorting all reads by the first N bases (5 by default) of each read and then comparing reads with the first N bases identical to each other for duplicates

    Which is very different than MarkDuplicates' method of using the alignments themselves to determine where the duplicates are.

    Most of the metrics in the MarkDuplicates metrics report are simply counting based, particularly if you already have duplicates marked. The only one that's more complex is the estimate of the library size - which is just an estimate, and therefore unreasonable to expect that you'll get exactly the same answer from several different methods. There's a thread on the samtools mailing list talking about the basis for those estimates if you really want to get down and dirty with it...

  • as_ubcas_ubc Member

    Hi Sheila and pdexheimer,

    Thanks very much for your replies.

    Shiela, what I was wondering is if it's possible to get the set of metrics found in the "duplication metrics" file produced by MarkDuplicates, but without having to run MarkDuplicates. This includes "UNPAIRED_READ_DUPLICATES", "READ_PAIR_DUPLICATES", "PERCENT_DUPLICATION", and "ESTIMATED_LIBRARY_SIZE".

    pdexheimer, thank you, that makes sense. Is there a different way to retrieve the metrics produced by MarkDuplicates? I have a pipeline that parses this output for plotting, so would prefer to run the same program, rather than retrieve the data in some other way, if possible.

    Thanks very much for your time.

  • as_ubcas_ubc Member

    Alright. Thanks for your response.

Sign In or Register to comment.