How to perform AddorReplaceReadGroups

I have inherited whole exome sequencing data for paired normal and tumor samples with the aim of identifying somatic variants. I have already performed the alignment to hg38, converted sam files to bam files and now would like to AddorReplaceReadGroups. I do not have the information on how many libraries were prepared. Samples were multiplexed i.e. groups of 3 - 10 samples were run per lane of the flow cell. Would it be appropriate to assume that one library was prepared per sample?

Answers

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @EleniStylianou,

    I think you'll find Article#6472 helpful. In particular, I had updated this article with the section titled Deriving ID and PU fields from read names. It shows how to breakdown Illumina sequencer read names for information on read groups.

  • Thank you for your answer but I am unclear how MarkDuplicates looks for duplicate reads. If I tell the algorithm that there is one library per sample per lane instead of one library for multiple samples in multiple lanes and this is actually wrong does this create a problem?
    Thank you

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    @EleniStylianou, you'll find details on MarkDuplicates in Article#6747. The article states:

    Appropriately assigned Read Group (RG) information. Read Group library (RGLB) information is factored for molecular duplicate detection. Optical duplicates are limited to those from the same RGID.

Sign In or Register to comment.