How to interpret output of EstimateLibraryComplexity
Dear all,
I am quite knew to the analysis of RNAseq data and was searching the www for an answer relating to the interpreation of the results outputted by Estimate_libraryComplexity. The only discussion I found was this one here (https://www.biostars.org/p/103503/). However, it doesn't sufficiently helped me with my problem.
I am interested in the distribution of the duplicates in my RNAseq dataset, which, when I understood it correctly, is shown in the histogram of the EstimateLibraryComplexity.
This is my output:
METRICS CLASS picard.sam.DuplicationMetrics
LIBRARY UNPAIRED_READS_EXAMINED READ_PAIRS_EXAMINED SECONDARY_OR_SUPPLEMENTARY_RDS UNMAPPED_READS UNPAIRED_READ_DUPLICATES READ_PAIR_DUPLICATES READ_PAIR_OPTICAL_DUPLICATES PERCENT_DUPLICATION ESTIMATED_LIBRARY_SIZE
Unknown 0 9285609 0 0 0 1376414 156167 0.148231 31035324
HISTOGRAM java.lang.Integer
duplication_group_count Unknown
1 6783948
2 951889
3 133263
4 25143
5 7450
6 3095
7 1569
8 915
9 532
10 380
11 260
12 164
13 141
14 91
15 79
16 37
...
What does this histogram mean? Does the "1" in the first row and first column mean that there are 6783948 reads that have 1 duplicate, or are these 6783948 unique reads? Otherwise how to I calculate the number of reads that do not have any duplicates.
Is it possible also to calculate the number of total reads examined or the total number of duplicates within this sample? If so how?
Is it also somehow possible to obtain the number of duplicates at each position, to which a read map, in the genome? Do I get this information as well from the histogram?
I hope, it is clear what I am asking about. I would be very happy if some of you could bring some light into the darkness.
Greetings Jutta
Best Answer

dekling Broad Institute admin
@Jutta: I apologize if my explanation was not clear.
The first part of your question:
You have 133,263 read pairs, in which each read pair has 3 sets of duplicates. You also have 3095 read pairs that have been duplicated 6 times and 79 read pairs that have been duplicated 15 times.Second part of your question:
First row, you have 6,783,948 read pairs with 1 duplicate set. Thus, 6,783,948 read pairs have a duplicates and only one set.I hope this helps. Let us know if it is still not clear.
Answers
@Jutta: The first column contains the number of duplicate sets, while the second column the numbers of read pairs from your BAM file with that number of duplications.
For example, 133,263 read pairs are duplicated 3 times... and 380 read pairs are duplicated ten times.
Unique read pairs (pairs without duplications) can be obtained by subtracting the sum of the READ_PAIR_DUPLICATES and READ_PAIR_OPTICAL_DUPLICATES from the number of READ_PAIRS_EXAMINED.
The output file: METRICS CLASS picard.sam.DuplicationMetrics has most of the information you are requesting. You might try using the MarkDuplicates tool to get the rest of the information you are requesting. Let us know if this is still unclear.
Dear Dekling,
Thanks for your answer. I am not sure if I still understand it correctly. So, when you say that 133,263 read pairs are duplicated 3 times, does it mean I have 3x133,263 read pairs in my data set that are the same? Or 133,263/3=44,421 read pairs of which 3 times the same exist?
And what about the first row: How to interpret that there are 6,783,948 that are duplicated 1 time. Does it mean they don't have a duplicate?
I also tried MarkDuplicates. But the number of duplicates I obtained from both tools was quite different within the same sample, I guess the reason for that are the different algorithm behind the two different tools. But as I am interested in the distribution of the duplicates among my reads, I focused on the output of EstimateLibraryComplexity.
We are aiming to compare the observed distribution of the duplicates among the read pairs with an expected distribution of duplicates in an simulation study, and for that it is very important to understand what I got.
Thanks for all your help.
Jutta
@Jutta: I apologize if my explanation was not clear.
The first part of your question:
You have 133,263 read pairs, in which each read pair has 3 sets of duplicates. You also have 3095 read pairs that have been duplicated 6 times and 79 read pairs that have been duplicated 15 times.
Second part of your question:
First row, you have 6,783,948 read pairs with 1 duplicate set. Thus, 6,783,948 read pairs have a duplicates and only one set.
I hope this helps. Let us know if it is still not clear.