The current GATK version is 3.7-0
Examples: Monday, today, last week, Mar 26, 3/26/04

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Get notifications!


You can opt in to receive email notifications, for example when your questions get answered or when there are new announcements, by following the instructions given here.

Did you remember to?


1. Search using the upper-right search box, e.g. using the error message.
2. Try the latest version of tools.
3. Include tool and Java versions.
4. Tell us whether you are following GATK Best Practices.
5. Include relevant details, e.g. platform, DNA- or RNA-Seq, WES (+capture kit) or WGS (PCR-free or PCR+), paired- or single-end, read length, expected average coverage, somatic data, etc.
6. For tool errors, include the error stacktrace as well as the exact command.
7. For format issues, include the result of running ValidateSamFile for BAMs or ValidateVariants for VCFs.
8. For weird results, include an illustrative example, e.g. attach IGV screenshots according to Article#5484.
9. For a seeming variant that is uncalled, include results of following Article#1235.

Did we ask for a bug report?


Then follow instructions in Article#1894.

Formatting tip!


Wrap blocks of code, error messages and BAM/VCF snippets--especially content with hashes (#)--with lines with three backticks ( ``` ) each to make a code block as demonstrated here.

Jump to another community
Picard 2.9.0 is now available. Download and read release notes here.
GATK 3.7 is here! Be sure to read the Version Highlights and optionally the full Release Notes.

How to interpret output of EstimateLibraryComplexity

JuttaJutta Bonn, GermanyMember Posts: 14

Dear all,

I am quite knew to the analysis of RNA-seq data and was searching the www for an answer relating to the interpreation of the results outputted by Estimate_libraryComplexity. The only discussion I found was this one here (https://www.biostars.org/p/103503/). However, it doesn't sufficiently helped me with my problem.

I am interested in the distribution of the duplicates in my RNA-seq dataset, which, when I understood it correctly, is shown in the histogram of the EstimateLibraryComplexity.
This is my output:

METRICS CLASS picard.sam.DuplicationMetrics

LIBRARY UNPAIRED_READS_EXAMINED READ_PAIRS_EXAMINED SECONDARY_OR_SUPPLEMENTARY_RDS UNMAPPED_READS UNPAIRED_READ_DUPLICATES READ_PAIR_DUPLICATES READ_PAIR_OPTICAL_DUPLICATES PERCENT_DUPLICATION ESTIMATED_LIBRARY_SIZE
Unknown 0 9285609 0 0 0 1376414 156167 0.148231 31035324

HISTOGRAM java.lang.Integer

duplication_group_count Unknown
1 6783948
2 951889
3 133263
4 25143
5 7450
6 3095
7 1569
8 915
9 532
10 380
11 260
12 164
13 141
14 91
15 79
16 37
...

What does this histogram mean? Does the "1" in the first row and first column mean that there are 6783948 reads that have 1 duplicate, or are these 6783948 unique reads? Otherwise how to I calculate the number of reads that do not have any duplicates.

Is it possible also to calculate the number of total reads examined or the total number of duplicates within this sample? If so how?

Is it also somehow possible to obtain the number of duplicates at each position, to which a read map, in the genome? Do I get this information as well from the histogram?

I hope, it is clear what I am asking about. I would be very happy if some of you could bring some light into the darkness.

Greetings Jutta

Issue · Github
by Sheila

Issue Number
1106
State
closed
Last Updated
Assignee
Array
Milestone
Array
Closed By
dekling

Best Answer

Answers

  • deklingdekling Broad InstituteMember Posts: 82 admin
    edited July 2016

    @Jutta: The first column contains the number of duplicate sets, while the second column the numbers of read pairs from your BAM file with that number of duplications.

    For example, 133,263 read pairs are duplicated 3 times... and 380 read pairs are duplicated ten times.

    Unique read pairs (pairs without duplications) can be obtained by subtracting the sum of the READ_PAIR_DUPLICATES and READ_PAIR_OPTICAL_DUPLICATES from the number of READ_PAIRS_EXAMINED.

    The output file: METRICS CLASS picard.sam.DuplicationMetrics has most of the information you are requesting. You might try using the MarkDuplicates tool to get the rest of the information you are requesting. Let us know if this is still unclear.

    Post edited by dekling on
  • JuttaJutta Bonn, GermanyMember Posts: 14

    Dear Dekling,

    Thanks for your answer. I am not sure if I still understand it correctly. So, when you say that 133,263 read pairs are duplicated 3 times, does it mean I have 3x133,263 read pairs in my data set that are the same? Or 133,263/3=44,421 read pairs of which 3 times the same exist?

    And what about the first row: How to interpret that there are 6,783,948 that are duplicated 1 time. Does it mean they don't have a duplicate?

    I also tried MarkDuplicates. But the number of duplicates I obtained from both tools was quite different within the same sample, I guess the reason for that are the different algorithm behind the two different tools. But as I am interested in the distribution of the duplicates among my reads, I focused on the output of EstimateLibraryComplexity.

    We are aiming to compare the observed distribution of the duplicates among the read pairs with an expected distribution of duplicates in an simulation study, and for that it is very important to understand what I got.

    Thanks for all your help.
    Jutta

Sign In or Register to comment.