If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We appreciate your help!
Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
problems with EstimateLibraryComplexity and MAX_GROUP_RATIO
it's long time that I've been using your tool EstimateLibraryComplexity to estimate the percentage of duplicated reads in sequencing experiment of high duplicated long-mate pair libraries. However, I noticed a large discrepancy in the estimation of duplicates, compared to other tools, such as bbcountunique.sh, part of the BBmap suite, when the number of duplicates become very hight.
For example I examined with EstimateLibraryComplexity, a 6M reads long mate-pair library for which bbcountunique.sh estimates 22.96% duplicates (without distinguish between optical and PCR duplicates). EstimateLibraryComplexity calculates 2.77% "PERCENT_DUPLICATION" which is an order of magnitude lower.
Looking at the other data the READ_PAIRS_EXAMINED was only 785,328, READ_PAIR_DUPLICATES =21,780, READ_PAIR_OPTICAL_DUPLICATES = 9,593 but more surprisingly, the program returned a long list of "WARNINGS":
WARNING 2015-10-16 18 43 50 EstimateLibraryComplexity Omitting group with over 500 times the expected mean number of read pairs. Mean 1, Actual .... Prefixes ....
The higher number of "Actual" was 1,018,992. I sow that there is the parameters MAX_GROUP_RATIO=500 (default) that control this behaviour. As reported in the documentation, this avoid to "process self-similar groups that are this many times over the mean expected group size" so, in my case a total of 4,722,532 read were exclude from the calculation of the % of duplicates.
Mi questions are:
What is the rationale to avoid taking into account groups with this high number of duplicates??
I tried to set MAX_GROUP_RATIO=10000000 to include those high duplicated reads onto the calculation of percentage of duplication but the program stalled.
What can I do to obtain the real percentage of duplicates in my sequencing files with EstimateLibraryComplexity?
Thank you very match for your explanations!