Heads up:
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!
Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
Attention:
We will be out of the office for a Broad Institute event from Dec 10th to Dec 11th 2019. We will be back to monitor the GATK forum on Dec 12th 2019. In the meantime we encourage you to help out other community members with their queries.
Thank you for your patience!
We will be out of the office for a Broad Institute event from Dec 10th to Dec 11th 2019. We will be back to monitor the GATK forum on Dec 12th 2019. In the meantime we encourage you to help out other community members with their queries.
Thank you for your patience!
Get Picard duplication metrics without running MarkDuplicates?

Is it possible to retrieve Picard duplication metrics without running MarkDuplicates? I would like to get these metrics for a merged bam file where the original bams already have dups flagged, and did not go through the same PCR.
Tagged:
Best Answer
-
pdexheimer ✭✭✭✭
No, if you want it to be in the same format you're probably just better off re-running MarkDuplicates. You would have to construct queries (I'd probably use samtools view) to get the read counts and duplicate counts, then you could use those values to calculate the percent duplication. I haven't made it all the way through that thread, so I don't know how to calculated estimated library size. Far simpler to just rerun the original program, I think
Answers
https://broadinstitute.github.io/picard/command-line-overview.html#EstimateLibraryComplexity
Thanks for the quick reply. Just ran EstimateLibraryComplexity on a bam that already went through MarkDuplicates, and got different results from the original duplication metrics. Should I be using non-default parameters?
I'm just a user and I just used the defaults. Personally, I wasn't expecting to get the exact same results, but I felt the output was close enough to what they would have been had I ran markduplicates from a pipeline run and the difference would not have changed my opinion on the data (I think in my experience the difference was less than 1%).
Sorry, but that's not satisfactory for me. If you run the supposedly same program on the same data twice and get different results, I think there is reason to be concerned, unless the program is randomly sub-sampling the data to derive these metrics. Is this the case? If so, is there a way for me to fix the seed? I don't see anything in the documentation.
@as_ubc
Hi,
I'm not sure I understand what exactly you are trying to do. Are you trying to find out how many duplicates are in your bam file?
Thanks,
Sheila
But they're not the same program. They use radically different methods - the EstimateLibraryComplexity description starts with
Which is very different than MarkDuplicates' method of using the alignments themselves to determine where the duplicates are.
Most of the metrics in the MarkDuplicates metrics report are simply counting based, particularly if you already have duplicates marked. The only one that's more complex is the estimate of the library size - which is just an estimate, and therefore unreasonable to expect that you'll get exactly the same answer from several different methods. There's a thread on the samtools mailing list talking about the basis for those estimates if you really want to get down and dirty with it...
Hi Sheila and pdexheimer,
Thanks very much for your replies.
Shiela, what I was wondering is if it's possible to get the set of metrics found in the "duplication metrics" file produced by MarkDuplicates, but without having to run MarkDuplicates. This includes "UNPAIRED_READ_DUPLICATES", "READ_PAIR_DUPLICATES", "PERCENT_DUPLICATION", and "ESTIMATED_LIBRARY_SIZE".
pdexheimer, thank you, that makes sense. Is there a different way to retrieve the metrics produced by MarkDuplicates? I have a pipeline that parses this output for plotting, so would prefer to run the same program, rather than retrieve the data in some other way, if possible.
Thanks very much for your time.
No, if you want it to be in the same format you're probably just better off re-running MarkDuplicates. You would have to construct queries (I'd probably use samtools view) to get the read counts and duplicate counts, then you could use those values to calculate the percent duplication. I haven't made it all the way through that thread, so I don't know how to calculated estimated library size. Far simpler to just rerun the original program, I think
Alright. Thanks for your response.