Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
How to interpret output of EstimateLibraryComplexity
I am quite knew to the analysis of RNA-seq data and was searching the www for an answer relating to the interpreation of the results outputted by Estimate_libraryComplexity. The only discussion I found was this one here (https://www.biostars.org/p/103503/). However, it doesn't sufficiently helped me with my problem.
I am interested in the distribution of the duplicates in my RNA-seq dataset, which, when I understood it correctly, is shown in the histogram of the EstimateLibraryComplexity.
This is my output:
METRICS CLASS picard.sam.DuplicationMetrics
LIBRARY UNPAIRED_READS_EXAMINED READ_PAIRS_EXAMINED SECONDARY_OR_SUPPLEMENTARY_RDS UNMAPPED_READS UNPAIRED_READ_DUPLICATES READ_PAIR_DUPLICATES READ_PAIR_OPTICAL_DUPLICATES PERCENT_DUPLICATION ESTIMATED_LIBRARY_SIZE
Unknown 0 9285609 0 0 0 1376414 156167 0.148231 31035324
What does this histogram mean? Does the "1" in the first row and first column mean that there are 6783948 reads that have 1 duplicate, or are these 6783948 unique reads? Otherwise how to I calculate the number of reads that do not have any duplicates.
Is it possible also to calculate the number of total reads examined or the total number of duplicates within this sample? If so how?
Is it also somehow possible to obtain the number of duplicates at each position, to which a read map, in the genome? Do I get this information as well from the histogram?
I hope, it is clear what I am asking about. I would be very happy if some of you could bring some light into the darkness.