Our documentation websites are currently offline due to a data center fire. We do not yet have an ETA for restoring service; we’ll update this message when we know more.

Read groups

edited February 23 in Dictionary

There is no formal definition of what is a read group, but in practice, this term refers to a set of reads that were generated from a single run of a sequencing instrument.

In the simple case where a single library preparation derived from a single biological sample was run on a single lane of a flowcell, all the reads from that lane run belong to the same read group. When multiplexing is involved, then each subset of reads originating from a separate library run on that lane will constitute a separate read group.

Read groups are identified in the SAM/BAM /CRAM file by a number of tags that are defined in the official SAM specification. These tags, when assigned appropriately, allow us to differentiate not only samples, but also various technical features that are associated with artifacts. With this information in hand, we can mitigate the effects of those artifacts during the duplicate marking and base recalibration steps. The GATK requires several read group fields to be present in input files and will fail with errors if this requirement is not satisfied. See this article for common problems related to read groups.

To see the read group information for a BAM file, use the following command.

samtools view -H sample.bam | grep '^@RG'

This prints the lines starting with @RG within the header, e.g. as shown in the example below.

@RG ID:H0164.2  PL:illumina PU:H0164ALXX140820.2    LB:Solexa-272222    PI:0    DT:2014-08-20T00:00:00-0400 SM:NA12878  CN:BI

Meaning of the read group fields required by GATK

  • ID = Read group identifier
    This tag identifies which read group each read belongs to, so each read group's ID must be unique. It is referenced both in the read group definition line in the file header (starting with @RG) and in the RG:Z tag for each read record. Note that some Picard tools have the ability to modify IDs when merging SAM files in order to avoid collisions. In Illumina data, read group IDs are composed using the flowcell + lane name and number, making them a globally unique identifier across all sequencing data in the world.
    Use for BQSR: ID is the lowest denominator that differentiates factors contributing to technical batch effects: therefore, a read group is effectively treated as a separate run of the instrument in data processing steps such as base quality score recalibration, since they are assumed to share the same error model.

  • PU = Platform Unit
    The PU holds three types of information, the {FLOWCELL_BARCODE}.{LANE}.{SAMPLE_BARCODE}. The {FLOWCELL_BARCODE} refers to the unique identifier for a particular flow cell. The {LANE} indicates the lane of the flow cell and the {SAMPLE_BARCODE} is a sample/library-specific identifier. Although the PU is not required by GATK but takes precedence over ID for base recalibration if it is present. In the example shown earlier, two read group fields, ID and PU, appropriately differentiate flow cell lane, marked by .2, a factor that contributes to batch effects.

  • SM = Sample
    The name of the sample sequenced in this read group. GATK tools treat all read groups with the same SM value as containing sequencing data for the same sample, and this is also the name that will be used for the sample column in the VCF file. Therefore it's critical that the SM field be specified correctly. When sequencing pools of samples, use a pool name instead of an individual sample name.

  • PL = Platform/technology used to produce the read
    This constitutes the only way to know what sequencing technology was used to generate the sequencing data. Valid values: ILLUMINA, SOLID, LS454, HELICOS and PACBIO.

  • LB = DNA preparation library identifier
    MarkDuplicates uses the LB field to determine which read groups might contain molecular duplicates, in case the same DNA library was sequenced on multiple lanes.

If your sample collection's BAM files lack required fields or do not differentiate pertinent factors within the fields, use Picard's AddOrReplaceReadGroups to add or appropriately rename the read group fields as outlined here.


Deriving ID and PU fields from read names

Here we illustrate how to derive both ID and PU fields from read names as they are formed in the data produced by the Broad Genomic Services pipelines (other sequence providers may use different naming conventions). We break down the common portion of two different read names from a sample file. The unique portion of the read names that come after flow cell lane, and separated by colons, are tile number, x-coordinate of cluster and y-coordinate of cluster.

H0164ALXX140820:2:1101:10003:23460
H0164ALXX140820:2:1101:15118:25288

Breaking down the common portion of the query names:

H0164____________ #portion of @RG ID and PU fields indicating Illumina flow cell
_____ALXX140820__ #portion of @RG PU field indicating barcode or index in a multiplexed run
_______________:2 #portion of @RG ID and PU fields indicating flow cell lane

Multi-sample and multiplexed example

Suppose I have a trio of samples: MOM, DAD, and KID. Each has two DNA libraries prepared, one with 400 bp inserts and another with 200 bp inserts. Each of these libraries is run on two lanes of an Illumina HiSeq, requiring 3 x 2 x 2 = 12 lanes of data. When the data come off the sequencer, I would create 12 bam files, with the following @RG fields in the header:

Dad's data:
@RG     ID:FLOWCELL1.LANE1      PL:ILLUMINA     LB:LIB-DAD-1 SM:DAD      PI:200
@RG     ID:FLOWCELL1.LANE2      PL:ILLUMINA     LB:LIB-DAD-1 SM:DAD      PI:200
@RG     ID:FLOWCELL1.LANE3      PL:ILLUMINA     LB:LIB-DAD-2 SM:DAD      PI:400
@RG     ID:FLOWCELL1.LANE4      PL:ILLUMINA     LB:LIB-DAD-2 SM:DAD      PI:400

Mom's data:
@RG     ID:FLOWCELL1.LANE5      PL:ILLUMINA     LB:LIB-MOM-1 SM:MOM      PI:200
@RG     ID:FLOWCELL1.LANE6      PL:ILLUMINA     LB:LIB-MOM-1 SM:MOM      PI:200
@RG     ID:FLOWCELL1.LANE7      PL:ILLUMINA     LB:LIB-MOM-2 SM:MOM      PI:400
@RG     ID:FLOWCELL1.LANE8      PL:ILLUMINA     LB:LIB-MOM-2 SM:MOM      PI:400

Kid's data:
@RG     ID:FLOWCELL2.LANE1      PL:ILLUMINA     LB:LIB-KID-1 SM:KID      PI:200
@RG     ID:FLOWCELL2.LANE2      PL:ILLUMINA     LB:LIB-KID-1 SM:KID      PI:200
@RG     ID:FLOWCELL2.LANE3      PL:ILLUMINA     LB:LIB-KID-2 SM:KID      PI:400
@RG     ID:FLOWCELL2.LANE4      PL:ILLUMINA     LB:LIB-KID-2 SM:KID      PI:400

Note the hierarchical relationship between read groups (unique for each lane) to libraries (sequenced on two lanes) and samples (across four lanes, two lanes for each library).

Post edited by Sheila on

Comments

  • santiagorevalesantiagorevale ArgentinaMember

    Hi there,

    I hope you could help me unravelling this, because I always found confusing the definitions and examples for ID and PU.

    In the beginning of the document it says

    In the simple case where a single library preparation derived from a single biological sample was run on a single lane of a flowcell, all the reads from that lane run belong to the same read group. When multiplexing is involved, then each subset of reads originating from a separate library run on that lane will constitute a separate read group.

    But then, in the middle we have

    ID = Read group identifier
    This tag identifies which read group each read belongs to, so each read group's ID must be unique. It is referenced both in the read group definition line in the file header (starting with @RG) and in the RG:Z tag for each read record. Note that some Picard tools have the ability to modify IDs when merging SAM files in order to avoid collisions. In Illumina data, read group IDs are composed using the flowcell + lane name and number, making them a globally unique identifier across all sequencing data in the world.

    According to the previous paragraph, two samples multiplexed in the same lane will have the same RG ID: flowcell + lane. Ergo, it won't be uniquely identifying each set of reads.

    On the other hand, we have the definition of PU as follows:

    PU = Platform Unit
    The PU holds three types of information, the {FLOWCELL_BARCODE}.{LANE}.{SAMPLE_BARCODE}. The {FLOWCELL_BARCODE} refers to the unique identifier for a particular flow cell. The {LANE} indicates the lane of the flow cell and the {SAMPLE_BARCODE} is a sample/library-specific identifier.

    According to this definition, PU is described just as how ID was expected to be working: uniquely identifying each set of reads, flowcell + lane + barcode.

    Besides, the name of the RG Platform Unit sounds like it should be here where flowcell + lane should be specified, because it's actually referencing what a unit is for Illumina.

    So, shouldn't this two definitions be swapped/fix/amend?

    Finally, regarding the Multi-sample and multiplexed example, could you extend the example so that it contains several samples multiplexed together and sequenced in several lanes? In e.g.

    **Flowcell 1, Lanes 1-6**
    Dad 200 bp (barcode 1)
    Mom 200 bp (barcode 2)
    Kid 200 bp (barcode 3)
    
    **Flowcell 1, Lanes 7-8**
    Dad 400 bp (barcode 4)
    Mom 400 bp (barcode 5)
    Kid 400 bp (barcode 6)
    
    **Flowcell 2, Lanes 1-4**
    Dad 400 bp (barcode 4)
    Mom 400 bp (barcode 5)
    Kid 400 bp (barcode 6)
    

    In this example, barcodes 1, 2 and 3 are pooled together, and this same pool in then sequenced across 6 lanes. The same is done with barcodes 4, 5 and 6. I believe an example like this will for sure help me understand how this is intended to look like and what is it that I'm missing or overseeing.

    Thank you very much in advanced for your help.

    Best regards,
    Santiago

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @santiagorevale
    Hi Santiago,

    According to the previous paragraph, two samples multiplexed in the same lane will have the same RG ID: flowcell + lane. Ergo, it won't be uniquely identifying each set of reads.

    In this case, the SM tag will distinguish between the two different samples.

    The only extra information the PU field may contain is the barcode or index in a multiplexed run. The PU field is not required by GATK. It is true some information is redundant.

    Finally, regarding the Multi-sample and multiplexed example, could you extend the example so that it contains several samples multiplexed together and sequenced in several lanes?

    Can you post what you think would be the correct way to label the read group? :smile:

    Thanks,
    Sheila

Sign In or Register to comment.