Principle of removing duplicated reads in Picard

MarkDuplicates of Picard is a useful function to remove duplicated reads. However, after reading the introduction of Picard (https://broadinstitute.github.io/picard/command-line-overview.html#MarkDuplicates), I still have some questions about filtering out the Duplicated reads. Here are the questions:

  1. The source of PCR duplicated includes "library/PCR-generated duplicates (LB)" and "sequencing-platform artifact duplicates (SQ)". How does Picard identify LB and SQ from reads?

  2. In default setting with REMOVE_DUPLICATES=true, which type of duplicated reads will be removed, SQ, LB, or both?

  3. The reads A, B, C are considered as duplicated reads, and their quality scores are equal. If these reads are mapped to the same position in genome, which reads will be removed after filtering by Picard?
    And if these reads are mapped to the different position in genome, which reads will be removed after filtering by Picard?

  4. Continue to the previous question, but the qualities of read A, B, C are not equal, which reads will be removed after filtering by Picard?

Thanks

Tagged:

Best Answer

Answers

Sign In or Register to comment.