We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

problems with EstimateLibraryComplexity and MAX_GROUP_RATIO

Hello everybody,

it's long time that I've been using your tool EstimateLibraryComplexity to estimate the percentage of duplicated reads in sequencing experiment of high duplicated long-mate pair libraries. However, I noticed a large discrepancy in the estimation of duplicates, compared to other tools, such as bbcountunique.sh, part of the BBmap suite, when the number of duplicates become very hight.

For example I examined with EstimateLibraryComplexity, a 6M reads long mate-pair library for which bbcountunique.sh estimates 22.96% duplicates (without distinguish between optical and PCR duplicates). EstimateLibraryComplexity calculates 2.77% "PERCENT_DUPLICATION" which is an order of magnitude lower.

Looking at the other data the READ_PAIRS_EXAMINED was only 785,328, READ_PAIR_DUPLICATES =21,780, READ_PAIR_OPTICAL_DUPLICATES = 9,593 but more surprisingly, the program returned a long list of "WARNINGS":

WARNING 2015-10-16 18 43 50 EstimateLibraryComplexity Omitting group with over 500 times the expected mean number of read pairs. Mean 1, Actual .... Prefixes ....

The higher number of "Actual" was 1,018,992. I sow that there is the parameters MAX_GROUP_RATIO=500 (default) that control this behaviour. As reported in the documentation, this avoid to "process self-similar groups that are this many times over the mean expected group size" so, in my case a total of 4,722,532 read were exclude from the calculation of the % of duplicates.

Mi questions are:

What is the rationale to avoid taking into account groups with this high number of duplicates??

I tried to set MAX_GROUP_RATIO=10000000 to include those high duplicated reads onto the calculation of percentage of duplication but the program stalled.

What can I do to obtain the real percentage of duplicates in my sequencing files with EstimateLibraryComplexity?

Thank you very match for your explanations!


Issue · Github
by Sheila

Issue Number
Last Updated
Closed By

Best Answer


Sign In or Register to comment.