problems with EstimateLibraryComplexity and MAX_GROUP_RATIO
it's long time that I've been using your tool EstimateLibraryComplexity to estimate the percentage of duplicated reads in sequencing experiment of high duplicated long-mate pair libraries. However, I noticed a large discrepancy in the estimation of duplicates, compared to other tools, such as bbcountunique.sh, part of the BBmap suite, when the number of duplicates become very hight.
For example I examined with EstimateLibraryComplexity, a 6M reads long mate-pair library for which bbcountunique.sh estimates 22.96% duplicates (without distinguish between optical and PCR duplicates). EstimateLibraryComplexity calculates 2.77% "PERCENT_DUPLICATION" which is an order of magnitude lower.
Looking at the other data the READ_PAIRS_EXAMINED was only 785,328, READ_PAIR_DUPLICATES =21,780, READ_PAIR_OPTICAL_DUPLICATES = 9,593 but more surprisingly, the program returned a long list of "WARNINGS":
WARNING 2015-10-16 18 43 50 EstimateLibraryComplexity Omitting group with over 500 times the expected mean number of read pairs. Mean 1, Actual .... Prefixes ....
The higher number of "Actual" was 1,018,992. I sow that there is the parameters MAX_GROUP_RATIO=500 (default) that control this behaviour. As reported in the documentation, this avoid to "process self-similar groups that are this many times over the mean expected group size" so, in my case a total of 4,722,532 read were exclude from the calculation of the % of duplicates.
Mi questions are:
What is the rationale to avoid taking into account groups with this high number of duplicates??
I tried to set MAX_GROUP_RATIO=10000000 to include those high duplicated reads onto the calculation of percentage of duplication but the program stalled.
What can I do to obtain the real percentage of duplicates in my sequencing files with EstimateLibraryComplexity?
Thank you very match for your explanations!