Test-drive the GATK tools and Best Practices pipelines on Terra

Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

How to interpret output of EstimateLibraryComplexity

JuttaJutta Bonn, GermanyMember

Dear all,

I am quite knew to the analysis of RNA-seq data and was searching the www for an answer relating to the interpreation of the results outputted by Estimate_libraryComplexity. The only discussion I found was this one here (https://www.biostars.org/p/103503/). However, it doesn't sufficiently helped me with my problem.

I am interested in the distribution of the duplicates in my RNA-seq dataset, which, when I understood it correctly, is shown in the histogram of the EstimateLibraryComplexity.
This is my output:

METRICS CLASS picard.sam.DuplicationMetrics

Unknown 0 9285609 0 0 0 1376414 156167 0.148231 31035324

HISTOGRAM java.lang.Integer

duplication_group_count Unknown
1 6783948
2 951889
3 133263
4 25143
5 7450
6 3095
7 1569
8 915
9 532
10 380
11 260
12 164
13 141
14 91
15 79
16 37

What does this histogram mean? Does the "1" in the first row and first column mean that there are 6783948 reads that have 1 duplicate, or are these 6783948 unique reads? Otherwise how to I calculate the number of reads that do not have any duplicates.

Is it possible also to calculate the number of total reads examined or the total number of duplicates within this sample? If so how?

Is it also somehow possible to obtain the number of duplicates at each position, to which a read map, in the genome? Do I get this information as well from the histogram?

I hope, it is clear what I am asking about. I would be very happy if some of you could bring some light into the darkness.

Greetings Jutta


Issue · Github
by Sheila

Issue Number
Last Updated
Closed By

Best Answer


  • deklingdekling Broad InstituteMember admin
    edited July 2016

    @Jutta: The first column contains the number of duplicate sets, while the second column the numbers of read pairs from your BAM file with that number of duplications.

    For example, 133,263 read pairs are duplicated 3 times... and 380 read pairs are duplicated ten times.

    Unique read pairs (pairs without duplications) can be obtained by subtracting the sum of the READ_PAIR_DUPLICATES and READ_PAIR_OPTICAL_DUPLICATES from the number of READ_PAIRS_EXAMINED.

    The output file: METRICS CLASS picard.sam.DuplicationMetrics has most of the information you are requesting. You might try using the MarkDuplicates tool to get the rest of the information you are requesting. Let us know if this is still unclear.

    Post edited by dekling on
  • JuttaJutta Bonn, GermanyMember

    Dear Dekling,

    Thanks for your answer. I am not sure if I still understand it correctly. So, when you say that 133,263 read pairs are duplicated 3 times, does it mean I have 3x133,263 read pairs in my data set that are the same? Or 133,263/3=44,421 read pairs of which 3 times the same exist?

    And what about the first row: How to interpret that there are 6,783,948 that are duplicated 1 time. Does it mean they don't have a duplicate?

    I also tried MarkDuplicates. But the number of duplicates I obtained from both tools was quite different within the same sample, I guess the reason for that are the different algorithm behind the two different tools. But as I am interested in the distribution of the duplicates among my reads, I focused on the output of EstimateLibraryComplexity.

    We are aiming to compare the observed distribution of the duplicates among the read pairs with an expected distribution of duplicates in an simulation study, and for that it is very important to understand what I got.

    Thanks for all your help.

Sign In or Register to comment.