Read groups

Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie
edited May 3 in Dictionary

There is no formal definition of what is a read group, but in practice, this term refers to a set of reads that were generated from a single run of a sequencing instrument.

In the simple case where a single library preparation derived from a single biological sample was run on a single lane of a flowcell, all the reads from that lane run belong to the same read group. When multiplexing is involved, then each subset of reads originating from a separate library run on that lane will constitute a separate read group.

Read groups are identified in the SAM/BAM /CRAM file by a number of tags that are defined in the official SAM specification. These tags, when assigned appropriately, allow us to differentiate not only samples, but also various technical features that are associated with artifacts. With this information in hand, we can mitigate the effects of those artifacts during the duplicate marking and base recalibration steps. The GATK requires several read group fields to be present in input files and will fail with errors if this requirement is not satisfied. See this article for common problems related to read groups.

To see the read group information for a BAM file, use the following command.

samtools view -H sample.bam | grep '@RG'

This prints the lines starting with @RG within the header, e.g. as shown in the example below.

@RG ID:H0164.2  PL:illumina PU:H0164ALXX140820.2    LB:Solexa-272222    PI:0    DT:2014-08-20T00:00:00-0400 SM:NA12878  CN:BI

Meaning of the read group fields required by GATK

  • ID = Read group identifier
    This tag identifies which read group each read belongs to, so each read group's ID must be unique. It is referenced both in the read group definition line in the file header (starting with @RG) and in the RG:Z tag for each read record. Note that some Picard tools have the ability to modify IDs when merging SAM files in order to avoid collisions. In Illumina data, read group IDs are composed using the flowcell + lane name and number, making them a globally unique identifier across all sequencing data in the world.
    Use for BQSR: ID is the lowest denominator that differentiates factors contributing to technical batch effects: therefore, a read group is effectively treated as a separate run of the instrument in data processing steps such as base quality score recalibration, since they are assumed to share the same error model.

  • PU = Platform Unit
    The PU holds three types of information, the {FLOWCELL_BARCODE}.{LANE}.{SAMPLE_BARCODE}. The {FLOWCELL_BARCODE} refers to the unique identifier for a particular flow cell. The {LANE} indicates the lane of the flow cell and the {SAMPLE_BARCODE} is a sample/library-specific identifier. Although the PU is not required by GATK but takes precedence over ID for base recalibration if it is present. In the example shown earlier, two read group fields, ID and PU, appropriately differentiate flow cell lane, marked by .2, a factor that contributes to batch effects.

  • SM = Sample
    The name of the sample sequenced in this read group. GATK tools treat all read groups with the same SM value as containing sequencing data for the same sample, and this is also the name that will be used for the sample column in the VCF file. Therefore it's critical that the SM field be specified correctly. When sequencing pools of samples, use a pool name instead of an individual sample name. Note, when we say pools, we mean samples that are not individually barcoded. In the case of multiplexing (often confused with pooling) where you know which reads come from each sample and you have simply run the samples together in one lane, you can keep the SM tag as the sample name and not the "pooled name".

  • PL = Platform/technology used to produce the read
    This constitutes the only way to know what sequencing technology was used to generate the sequencing data. Valid values: ILLUMINA, SOLID, LS454, HELICOS and PACBIO.

  • LB = DNA preparation library identifier
    MarkDuplicates uses the LB field to determine which read groups might contain molecular duplicates, in case the same DNA library was sequenced on multiple lanes.

If your sample collection's BAM files lack required fields or do not differentiate pertinent factors within the fields, use Picard's AddOrReplaceReadGroups to add or appropriately rename the read group fields as outlined here.


Deriving ID and PU fields from read names

Here we illustrate how to derive both ID and PU fields from read names as they are formed in the data produced by the Broad Genomic Services pipelines (other sequence providers may use different naming conventions). We break down the common portion of two different read names from a sample file. The unique portion of the read names that come after flow cell lane, and separated by colons, are tile number, x-coordinate of cluster and y-coordinate of cluster.

H0164ALXX140820:2:1101:10003:23460
H0164ALXX140820:2:1101:15118:25288

Breaking down the common portion of the query names:

H0164____________ #portion of @RG ID and PU fields indicating Illumina flow cell
_____ALXX140820__ #portion of @RG PU field indicating barcode or index in a multiplexed run
_______________:2 #portion of @RG ID and PU fields indicating flow cell lane

Multi-sample and multiplexed example

Suppose I have a trio of samples: MOM, DAD, and KID. Each has two DNA libraries prepared, one with 400 bp inserts and another with 200 bp inserts. Each of these libraries is run on two lanes of an Illumina HiSeq, requiring 3 x 2 x 2 = 12 lanes of data. When the data come off the sequencer, I would create 12 bam files, with the following @RG fields in the header:

Dad's data:
@RG     ID:FLOWCELL1.LANE1      PL:ILLUMINA     LB:LIB-DAD-1 SM:DAD      PI:200
@RG     ID:FLOWCELL1.LANE2      PL:ILLUMINA     LB:LIB-DAD-1 SM:DAD      PI:200
@RG     ID:FLOWCELL1.LANE3      PL:ILLUMINA     LB:LIB-DAD-2 SM:DAD      PI:400
@RG     ID:FLOWCELL1.LANE4      PL:ILLUMINA     LB:LIB-DAD-2 SM:DAD      PI:400

Mom's data:
@RG     ID:FLOWCELL1.LANE5      PL:ILLUMINA     LB:LIB-MOM-1 SM:MOM      PI:200
@RG     ID:FLOWCELL1.LANE6      PL:ILLUMINA     LB:LIB-MOM-1 SM:MOM      PI:200
@RG     ID:FLOWCELL1.LANE7      PL:ILLUMINA     LB:LIB-MOM-2 SM:MOM      PI:400
@RG     ID:FLOWCELL1.LANE8      PL:ILLUMINA     LB:LIB-MOM-2 SM:MOM      PI:400

Kid's data:
@RG     ID:FLOWCELL2.LANE1      PL:ILLUMINA     LB:LIB-KID-1 SM:KID      PI:200
@RG     ID:FLOWCELL2.LANE2      PL:ILLUMINA     LB:LIB-KID-1 SM:KID      PI:200
@RG     ID:FLOWCELL2.LANE3      PL:ILLUMINA     LB:LIB-KID-2 SM:KID      PI:400
@RG     ID:FLOWCELL2.LANE4      PL:ILLUMINA     LB:LIB-KID-2 SM:KID      PI:400

Note the hierarchical relationship between read groups (unique for each lane) to libraries (sequenced on two lanes) and samples (across four lanes, two lanes for each library).

Post edited by Sheila on

Comments

  • LuisaBLuisaB SwitzerlandMember

    Hi,

    sorry, but this is still not clear to me.

    My situation is the following: I have 475 individuals, sequenced in a unique run of a single instrument (same flowcell). On each lane of this run, 95 individuals were multiplexed (so I used 5 lanes). Each batch of 95 individuals all belong to the same library, so to be clear I have 5 libraries, each containing 95 individuals and each library was sequenced on a different lane of the same run (same flowcell).

    I added the read group information as follows.
    For example, for individual F008_01 belonging to library 1 and sequenced on lane 4:
    RGID=lane4
    RGPU=lane4
    RGSM=ind_F008_01
    RGPL=illumina
    RGLB=lib1

    For individual F033_01 belonging to library 2 and sequenced on lane 5:
    RGID=lane5
    RGPU=lane5
    RGSM=ind_F033_01
    RGPL=illumina
    RGLB=lib2

    Is this correct? Should I add also the name of the individual in ID or PU or is there a single read group for each lane, independently of how many individuals I have on a lane?

    The header of the reads in the fastq file looks like this for F008_01 (lane 4):
    @HWI-ST0747:467:C59MAACXX:4:1101:1979:2080 1:N:0:
    and like this for F033_01 (lane 5):
    @HWI-ST0747:467:C59MAACXX:5:1101:16528:2235 1:N:0:
    and the part before the lane information is the same for all lanes (@HWI-ST0747:467:C59MAACXX).
    Therefore, does it matter if I don't specify it neither in RGID nor in RGPU? It's like if all this was replaced by the word "lane" in the RG information I used. Is it clear what I mean?

    If the way I used is not correct, can you please tell me how I should specify the RG information in the correct way?

    Thank you very much,

    Luisa

    Issue · Github
    by Sheila

    Issue Number
    420
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    vdauwera
  • LuisaBLuisaB SwitzerlandMember

    Hi,

    any news about my question?

    Thanks,

    Luisa

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Sorry, this sort of question is assigned a low priority because we try to help people with blocking issues first, or issues that are specific to GATK. We'll make sure someone responds on Monday.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hi Luisa,

    Most of our team has gone away for the holiday break and I'm not the most knowledgeable about read groups, but here's my interpretation. If we look at the first example in the doc, we see the difference between ID and PU:

    @RG ID:H0164.2  PL:illumina PU:H0164ALXX140820.2 [...]
    

    If you consider that the core part of the read name is composed of three segments, you see that ID just contains the first and last segments, whereas PU contains all of it. That's because the middle segment is the barcode information that is specific to each multiplexed sample (for which at this point you should have separate files). In you case I think this would be broken down like this:

    @RG ID:C59MA.4  PL:illumina PU: C59MAACXX.4 [...]
    

    Within a single sample's file this looks redundant but if you were to combine these into multi-sample bams they would provide a useful differentiation per-sample for the purpose of base recalibration.

  • LuisaBLuisaB SwitzerlandMember
    edited January 2016

    Hi @Geraldine_VdAuwera,

    thank you for your answer.
    If I understood it correctly, what you mean is that the part that is present in PU but not in ID should be different in different samples, so PU contains more information than ID and this helps if I have multi-sample bam files.
    But the point is that in my case this part @HWI-ST0747:467:C59MAACXX is the same for all the samples on all the lanes I use. Only the number after this part changes. For example, in the read name I have @HWI-ST0747:467:C59MAACXX:4 if a sample was sequenced on lane 4, @HWI-ST0747:467:C59MAACXX:5 if a sample (the same sample or a different one) was sequenced on lane 5.
    So, I don't add any information adding a part that is the same for all my samples.
    Or have I misunderstood something?

    Thanks again,

    Luisa

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hi Luisa,

    You mentioned that 95 individuals were multiplexed on your 5 lanes. How do you distinguish the 95 individuals? Or are they pooled without separate barcoding?

  • LuisaBLuisaB SwitzerlandMember

    Hi @Geraldine_VdAuwera,

    yes, 95 individuals were multiplexed on each of the 5 lanes. I have single-end RAD-seq data for these individuals (sorry, I think I didn't specify this before).
    I can distinguish the 95 individuals because before the demultiplexing step, each read starts with a barcode, unique for each individual of the lane. It is an inline barcode, so it doesn't appear in the read name.

    Let me know if you need additional information.
    Thanks,

    Luisa

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    I see -- then I would recommend using that barcode or something equivalent in place of the code that would normally be included in the read name.

  • jfbjfb Member
    edited January 2016

    Hi,
    I found your back-and-forth with Luisa helpful to my confusion about how to properly assign read groups to multiplexed samples on a nextseq (i.e., multiple samples are pooled into one library that are then spread across four lanes), but I still can't tell if you're suggesting to add the sample barcode into the ID field or keep it only in PU. The example with MOM,DAD,KID is nice but it might be even more helpful if it had some cases where there were multiple samples per lane. Also isn't the read definition you are using for determining PU vs ID a bit dated? The nextseq header I think is:
    @instrument:run_number:flowcell_ID:lane:tile:x-pos:y-pos read:is_filtered:control_number:index_sequence

    In my case libraries from tumor and normal from patient A are pooleed and loaded onto a nextseq. The tumorA fastq headers for lane 1are:
    @NS500126:394:HL5H3BGXX:1:11101:12343:1053 1:N:0:AACCAG
    and for normalA:
    @NS500126:394:HL5H3BGXX:1:11101:6442:1053 1:N:0:CACAGT

    So I think I should define
    ID=HL5H3BGXX.1 and PU=HL5H3BGXX.AACCAG.1and SM=tumorA
    ID=HL5H3BGXX.1 and PU=HL5H3BGXX.CACAGT.1 and SM=normalA

    Do I have this right?

    Thanks, John

    P.S. I was reading your other helpful post http://gatkforums.broadinstitute.org/gatk/discussion/3060/which-data-processing-steps-should-i-do-per-lane-vs-per-sample and trying to think ahead about what happens to these read groups when bams from the same sample (but different lanes) are merged prior to bqsr and indel realignment and then (as in my case of tumor/normal pairs) merged again. It seems like the first merge is fine but wont the RG:IDs clash during the tumor:normal merge?

  • shleeshlee CambridgeMember, Broadie, Moderator
    edited January 2016

    Hi John,

    You CANNOT use your RG ID definitions. The @RG ID information field is the lowest denominator that distinguishes between samples and factors that contribute to technical artifacts, e.g. different flow cell lanes. The RG ID's field is applied to every read with the RG tag. For your definition, the ID fields are identical for your tumor and normal samples and when the files are merged for joint analysis, our tools are then unable to distinguish reads from the two different files.

    Here are my general recommendations:
    (1) If the @RG header information assigned by your sequencing provider allows you to analyze your datasets, then great. Don't mess with it.
    (2) If not, use Picard's AddOrReplaceReadGroups to replace your read group information fields so the @RG ID is unique per sample per lane.

    Some minor additional comments:
    - The @RG ID field is important up to HaplotypeCaller. After a VCF is produced, samples are distinguished by SM (sample name).
    - Because the RG tag is repeated for each read, the length of the tag can contribute to file size, e.g. SAM format files. If the files are compressed, e.g. BAM format, then this length should not matter as much.

    Thanks for the nextseq header information. I'm new to the team and this is the first time I've seen such a format. So I have a question for you in turn. I'm curious if each read name then also contains the read:is_filtered:control_number:index_sequence information. Or if each read queryname contains at least the index_sequence information. If you can clarify, we can keep this in mind for future documentation.

    Thanks, and I hope this clarifies your question.

    -Soo Hee

  • jfbjfb Member

    HI Soo Hee,

    Thanks for your help. So it sounds like I should instead define:
    ID=HL5H3BGXX.AACCAG.1and SM=tumorA
    ID=HL5H3BGXX.CACAGT.1 and SM=normalA
    and not worry about PU. I suppose for completeness I could put the machine and run info there:
    PU=NS500126.394
    but it's probably not necessary.

    I'm glad you mentioned that the length of the ID tag doesn't matter for BAM files...I was regretting defining such a long name.

    If I understand your question correctly, we parse the fastqs by sample barcode so the header I showed is the same for every read in the file (except of course for the tile:x-pos:y-pos fields). I believe this type of header is generic for Illumina since at least casava 1.8:
    http://support.illumina.com/help/SequencingAnalysisWorkflow/Content/Vault/Informatics/Sequencing_Analysis/CASAVA/swSEQ_mCA_FASTQFiles.htm
    but I suppose the format is probably customizable.

  • shleeshlee CambridgeMember, Broadie, Moderator

    John,

    Your new RGID fields look good in that they now differentiate your samples.

    However, RECONSIDER your RGPU field as it has implications for BaseRecalibrator.

    Setting all your PU fields identical across samples has implications that you want to avoid. I recommend leaving the PU field at the original unique values (PU=HL5H3BGXX.AACCAG.1 & PU=HL5H3BGXX.CACAGT.1) over removing the PU field. This seems redundant given our tools also require unique RGID fields but is the best option for data processing and record keeping.

    The implications you want to avoid have to do with BaseRecalibrator. BaseRecalibrator uses the PU field if present and the PU field takes precedence over RGID as this user's post illustrates.

    Specifically, BaseRecalibrator uses four covariates in determining error models that it then uses to adjust base quality scores. One of these is the RGID or RGPU if present.

    Typical approach
    We multiplex samples across lanes and use the lane as a covariate in BaseRecalibrator. For example, my DNA, split across different lanes (marked by different RGIDs), would be aggregated and processed as one file through BaseRecalibrator with lane as a covariate. Your DNA would also be aggregated from across lanes and processed as one through BaseRecalibrator. In this approach, each sample is processed separately and is implicitly a covariate.

    Your approach
    All your samples were sequenced in the same flow cell lane, and so you are unconcerned with lane level covariation. You should still be concerned with sample level covariation. For example, what if one sample was degraded? Would you want to propagate the degraded sample's error models across all your samples?

    • If your RGPU is identical across your samples, then you effectively remove one of the covariates (sample) from consideration in the error models.
    • If your RGPU differentiates your samples, then BaseRecalibrator can consider samples (defined by RGPU) as a covariate.
    • If RGPU is absent, then BaseRecalibrator uses RGID (which already define samples per lane in your case) as a covariate.

    Apologies for the length of this response. Also, thank you for the CASAVA link. Please let me know if anything is unclear.

    -Soo Hee

  • jfbjfb Member

    OK thanks. I'll update my pipeline accordingly.

  • LuisaBLuisaB SwitzerlandMember
    edited January 2016

    Hi @shlee,

    thank you for your explanations.
    I still have a couple of questions about this topic though. As I wrote here above, I have single-end RAD-seq data for 475 individuals, sequenced in a unique run of a single instrument (same flowcell). On each lane of this flowcell, 95 individuals were multiplexed (so I used 5 lanes). In each batch, all 95 individuals were multiplexed in a single library. So, I have 5 libraries, each containing 95 individuals, and each library was sequenced on a single lane of the flowcell.

    Among these samples, I have one sample that was multiplexed in two different libraries and therefore sequenced on two different lanes (as a kind of replicate). The reads belonging to this sample are identified by a barcode for the first replicate (sequenced on lane 5), and by a different barcode for the other replicate (sequenced on lane 8). So, I would write this for the first replicate:

    RGID=C59MAACXX.TCGGCAGTCGTGCAG.5
    RGPU=C59MAACXX.TCGGCAGTCGTGCAG.5
    RGSM=F039_05
    RGLB=C59MAACXX.TCGGCAGTCGTGCAG.5
    RGPL=illumina

    and this for the second replicate:
    RGID=C59MAACXX.AAGACCAATCTGCAG.8
    RGPU=C59MAACXX.AAGACCAATCTGCAG.8
    RGSM=F039_05
    RGLB=C59MAACXX.AAGACCAATCTGCAG.8
    RGPL=illumina

    Is this ok, or the fact that I have two different barcodes for the same sample create problems? May I use the name of the individual (F039_05), instead of the barcode? I mean something like this for the first replicate:

    RGID=C59MAACXX.F039_05.5
    RGPU=C59MAACXX.F039_05.5
    RGSM=F039_05
    RGLB=C59MAACXX.F039_05.5
    RGPL=illumina

    and this for the second replicate:
    RGID=C59MAACXX.F039_05.8
    RGPU=C59MAACXX.F039_05.8
    RGSM=F039_05
    RGLB=C59MAACXX.F039_05.8
    RGPL=illumina

    I'm asking this also because the barcode does not appear in my read headers, that look like this:
    @HWI-ST0747:467:C59MAACXX:5:1101:16528:2235 1:N:0:
    so I am not sure it is ok to use it in the RG information.
    In addition to this, the same barcodes are used in different lanes to multiplex different individuals (e.g. barcode CTAACACGGCTGCAG is used for sample F008_02 in lane 4, and for sample F033_02 in lane 5).

    One last question: regarding RGLB, is it ok to write the same value used for ID and PU, or should RGLB be the same for the 95 individuals multiplexed in the same library? e.g. something like C59MAACXX.5 for all the individuals multiplexed in the library sequenced on lane 5?

    Sorry for the long post and thank you for your help.
    Please let me know if something I wrote is unclear.

    Luisa

  • shleeshlee CambridgeMember, Broadie, Moderator
    edited January 2016

    Hi @LuisaB,

    First question about how to label your single technical/biological replicate
    Let me ask you a question in turn. What is your reason for including this technical/biological replicate in your experimental dataset? What are you trying to control for? Depending on your answer to this question, I would conjecture how you use the RG fields would differ. The only thing I'd like to point out is that the RGSM field is what distinguishes samples in the VCF, and your two technical/biological replicates, by virtue of their identical RGSMs, would become indistinguishable. I'm going to guess, based on my own unconfirmed assumptions, that for your aims this is undesirable. If this is so, we'd suggest adding .5 and .8 to your replicate sample names to distinguish them.

    Second question about RGLB usage
    The dictionary entry above states:

    MarkDuplicates uses the LB field to determine which read groups might contain molecular duplicates, in case the same DNA library was sequenced on multiple lanes.

    This means that MarkDuplicates is a tool that is aware of the RGLB field.

    However, my limited knowledge of RAD-Seq has me remembering it is more like a targeted amplification library where you'd expect mostly duplicates, even more so given the single ended nature of your reads. In this situation, marking duplicates isn't so relevant. So in this context, I would fathom how you want to use the RGLB field is up to you. For other downstream implications of the RGLB designations, I have to defer to @Geraldine_VdAuwera and @Sheila.

    Until one of them can get back to you or I can check with them, I will also refer you to two posts : (1) an overview of lane, library, sample and cohort and (2) a discussion of how MarkDuplicates handles library information.

    I hope this is helpful.

    -Soo Hee

    Post edited by shlee on
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    @shlee is correct that the RGLB is mostly used for marking duplicates and related QC operations (eg estimating library complexity). Downstream GATK tools do not use RGLB. BaseRecalibrator uses RGID or RGPU if present; all other GATK tools only use RGSM.

    Regarding technical replicates, how to treat them depends on the intended purpose of the replication, as @shlee also correctly pointed out. Some people use technical replicates to evaluate statistical properties of their analysis; for that purpose you should label them with different RGSMs. Others use them to accumulate more data per sample; for that purpose you would use the same RGSMs. It's up to you and your experimental design.

  • LuisaBLuisaB SwitzerlandMember
    edited January 2016

    Hi @shlee and @Geraldine_VdAuwera,

    thank you for your replies.

    Regarding RGLB, you are correct, I cannot use MarkDuplicates with my data. So I will set RGLB=null for all my samples.

    Regarding the replicates, I will not use them as "real" replicates. I had an empty spot and I decided to sequence this individual twice. So I will keep the same value for RGSM, because I want them to be identified as single individual during the SNP calling. But still, I think it is important that BQSR manages to distinguish them, since they were sequenced on different lanes and therefore they will show a different error model. So, what I wanted to know in my previous post is if it is correct to include the barcodes in RGID and RGPU, also if two different barcodes identify the same individual in the two lanes, or if I should rather use directly the name of the individual (F039_05 - as specified in my previous post):

    RGID=C59MAACXX.TCGGCAGTCGTGCAG.5
    RGPU=C59MAACXX.TCGGCAGTCGTGCAG.5
    RGSM=F039_05
    and
    RGID=C59MAACXX.AAGACCAATCTGCAG.8
    RGPU=C59MAACXX.AAGACCAATCTGCAG.8
    RGSM=F039_05

    or rather:
    RGID=C59MAACXX.F039_05.5
    RGPU=C59MAACXX.F039_05.5
    RGSM=F039_05
    and
    RGID=C59MAACXX.F039_05.8
    RGPU=C59MAACXX.F039_05.8
    RGSM=F039_05

    also because, as I wrote before, the barcode does not appear in my read headers, and the same barcodes are used in different lanes to multiplex different individuals (e.g. barcode CTAACACGGCTGCAG is used for sample F008_02 in lane 4, and for sample F033_02 in lane 5).
    So maybe it would be easier to just use the names of the individuals also in ID and PU (e.g. C59MAACXX.F039_05.5) rather than the barcodes (e.g. C59MAACXX.TCGGCAGTCGTGCAG.5). Is it possible to do this or do I really need the barcodes?

    Thanks again,

    Luisa

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    I wouldn't recommend using "null" for any required field. That is a reserved word in some programs and may be misinterpreted as a missing value by some. Call it Charlie if you want, just not null or unknown or special words like that.

    Ultimately you can use whatever you want in the RG fields as long as they disambiguate the origin of the reads appropriately. We just provide guidelines based on the conventions we use but there are few to no absolute naming rules.

  • LuisaBLuisaB SwitzerlandMember

    Ok, I will not use "null".
    Thanks,

    Luisa

  • myprogramming2016myprogramming2016 CanadaMember
    edited March 2016

    Hi,
    I need some help for adding read group to sorted bam files.
    I have more than 100 sorted bam files. I would like to use picard tools to add read group.
    Could you please share a command or script top add read group in one go?

    Thanks

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @myprogramming2016
    Hi,

    Hopefully someone will share on here. We do not provide such scripts. You can also ask on some other forums as well.

    -Sheila

  • benjaminpelissiebenjaminpelissie Madison, WIMember

    Hi,

    This is a very helpful thread, but I still encounter some doubts about my analyses.

    I have 88 samples, paired-end sequenced on Illumina Hiseq, that were multiplexed on the 8 lanes of a same run (i.e., in the same flowcell, giving 10 samples per lane, with tags re-used in different lanes). Here is the first line of the fastq files for sample CPBWGS_02:
    @HWI-D00256:413:C7N5GANXX:1:1101:1112:2084 1:N:0:ATTACTCGATAGAGGC
    @HWI-D00256:413:C7N5GANXX:1:1101:1112:2084 2:N:0:ATTACTCGATAGAGGC

    Among those 88 samples, 10 were re-sequenced (due to quality issues) in 1 lane of another run (i.e., in another flowcell), using the same multiplexing tags as used in the first run. Here is the first line of the fastq file for sample CPBWGS_02_v2 (which is a re-sequencing of CPBWGS_02):
    @HWI-D00256:427:HCHMJBCXX:1:1101:1224:2149 1:N:0:ATTACTCGATAGAGGC
    @HWI-D00256:427:HCHMJBCXX:1:1101:1224:2149 2:N:0:ATTACTCGATAGAGGC

    My understanding is that 1. I have to input the tag name in ID (rather than just cell and lane infos) in order to deal with the multiplexing, and 2. I should structure my data per lane in SM, in order to allow an accurate detection of duplicates down the road. Am I right?

    Currently, I am defining:

    • for CPBWGS_02:
      ID=C7N5GANXX.ATTACTCGATAGAGGC.1
      SM=CPBWGS_02
      LB=C7N5GANXX.1

    • for CPBWGS_02_v2:
      ID=HCHMJBCXX.ATTACTCGATAGAGGC.1
      SM=CPBWGS_02 # I want those reads to be pooled for variant calling
      LB=HCHMJBCXX.1

    Would that be ok?

    As for PU, I was thinking about not defining it (since it would be entirely redundant with ID for all my samples), but @shlee replied to John that it would be problematic for his analyses. Is it relevant in my case?

    Many thanks,
    Ben

  • shleeshlee CambridgeMember, Broadie, Moderator

    Hi @benjaminpelissie. We haven't forgotten you and will get back to you sometime soon about your question.

  • shleeshlee CambridgeMember, Broadie, Moderator
    edited May 2016

    Hi @benjaminpelissie,

    I hope I can help you with your questions. Others from my team may jump in with thoughts.

    Before delving into your questions, a preamble:

    The term multiplexing could be used at two levels. (A) The risk-spreading type, e.g. multiplexing 8 samples into a single library that is run across 8 different lanes. And (B) the cost-saving type, e.g. 8 samples run in a single lane. Both cases require we barcode each sample so that we can identify which reads belong to which samples. When we demultiplex or demux, we use the barcodes to separate out each sample's reads into individual files.

    As shown in the trio example, a sample can have multiple libraries, e.g. SM=Dad has 200bp insert and 400bp insert libraries (2 different LB labels). So SM>LB>ID or SM>LB>ID>PU for BQSR. Our tools take the lowest denominator read group component into consideration, typically ID. MarkDuplicates additionally factors for LB (to flag molecular duplicates) and BQSR takes PU into consideration, if present, over ID.

    Now some responses to your questions:

    I have 88 samples, paired-end sequenced on Illumina Hiseq, that were multiplexed on the 8 lanes of a same run (i.e., in the same flowcell, giving 10 samples per lane, with tags re-used in different lanes).

    From your description--11 samples per lane for 88 samples--it appears option (B) applies to your scenario. Let me ask, what was the reason you had to re-sequence a lane's worth of samples? Were these samples all from the same lane in the initial run? Meaning, are their quality issues associated with the lane they shared or with the samples themselves?

    Among those 88 samples, 10 were re-sequenced (due to quality issues) in 1 lane of another run (i.e., in another flowcell), using the same multiplexing tags as used in the first run.

    Let me ask another question. Did you reload the same library prep of the samples as the first run for the resequencing or did you remake the sample libraries using the same barcodes? Depending on this and assuming you will still analyze your "bad" data (because you say "I want those reads to be pooled for variant calling"), how you label LB will differ and has impact on marking duplicates. If they come from the same library prep (the same tube you handed to the sequencing center) and assuming you are salvaging the bad lane of data along side the resequenced data, then you'd want to make sure LBs are identical for these Jekyll-and-Hyde sets. This is because MarkDuplicates will consider molecular duplicates for identical LBs despite differing IDs. Optical duplicates are detected per ID group. If they did not come from the same prep (you remade your libraries), then the LB fields should be different.

    My understanding is that 1. I have to input the tag name in ID (rather than just cell and lane infos) in order to deal with the multiplexing, and 2. I should structure my data per lane in SM, in order to allow an accurate detection of duplicates down the road. Am I right?

    Yes, you want to differentiate your samples in the same lane. So using the barcode tag as part of the ID is one means to this end. Sample (SM) should refer to the individual you are sequencing, e.g. dad, for whom you want variant calls to be represented in a single column in the VCF. The SM field is not a factor in duplicate flagging as I mentioned above.

    Currently, I am defining:

    for CPBWGS_02:
    ID=C7N5GANXX.ATTACTCGATAGAGGC.1
    SM=CPBWGS_02
    LB=C7N5GANXX.1
    
    for CPBWGS_02_v2:
    ID=HCHMJBCXX.ATTACTCGATAGAGGC.1
    SM=CPBWGS_02 # I want those reads to be pooled for variant calling
    LB=HCHMJBCXX.1
    

    Would that be ok?

    If the library was identical, change your LB fields to be identical. Otherwise, keep them different. Yes, keep your SM fields identical as is. Your ID fields are already different--they distinguish your Jekyll-and-Hyde runs. So this is good.

    As for PU, I was thinking about not defining it (since it would be entirely redundant with ID for all my samples), but @shlee replied to John that it would be problematic for his analyses. Is it relevant in my case?

    I don't remember all the thought that went into John's case. What is clear is that your cases are different. It should be fine for you to omit PU and stick with your IDs.

    I'd like to add that, the way I understand BQSR, be sure to provide it all the data from a single lane and run it separately for each lane. This means you'll be running 9 instances of BQSR and your Jekyll-and-Hyde sets are run separately. This approach catches and helps to alleviate lane-level issues that can seriously impact the quality of data, such as those caused by under or overloading a lane.

    This got really long. Again, I hope this is helpful. Let us know if anything is unclear or if you have more questions.

  • benjaminpelissiebenjaminpelissie Madison, WIMember

    Thank you @shlee, that really helps. Because our sequencing center should have just reloaded the same library preps to resequence those 10 individuals, I guess should have given my Jekyll-and-Hyde samples the same LB value. However, I am now at a step where I mapped all my samples to a reference and was about to mark duplicates. Can I just run AddOrReplaceReadGroups on the mapped bam files before marking duplicates, or should I start over and re-create uBAM from scratch?

  • shleeshlee CambridgeMember, Broadie, Moderator

    @benjaminpelissie You can still change the LB labels as you describe (run AddOrReplaceReadGroups on the mapped bam), since MarkDuplicates is the only tool that should care about it. No need to start from scratch. That being said, it occurs to me you have an opportunity to see what are some differences between your Jekyll-and-Hyde sets, at least at the duplicate and library complexity level by running independent MarkDuplicates runs by providing the files separately to MarkDuplicates. After this, you can flag duplicates again for the LB grouping by feeding both Jekyll and Hyde files into a MarkDuplicates run. You or your sequencing center probably have already done or considered such things.

  • benjaminpelissiebenjaminpelissie Madison, WIMember

    This is a good news! Thank you @shlee. Yes I did consider to mark duplicates at the lane then at the sample (SM) level, although I am realizing that I wouldn't have been able to do it with different LBs for my Jekyll-and-Hyde sets... so thanks again!

  • benjaminpelissiebenjaminpelissie Madison, WIMember

    I noted that after modifying RGLB and adding RGPU to my mapped bam, the output files created by AddOrReplaceReadGroups were exactly the same size as input files OR exactly 4 bytes heavier (5 cases out of 11). I don't know what it means but this is clearly a pattern.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie
    Is it a pattern that you're worried about? That sounds like just a harmless artifact of compression.
  • RDWRDW Member

    @Geraldine_VdAuwera said:

    • PU = Platform Unit
      The PU holds three types of information, the {FLOWCELL_BARCODE}.{LANE}.{SAMPLE_BARCODE}. The {FLOWCELL_BARCODE} refers to the unique identifier for a particular flow cell. The {LANE} indicates the lane of the flow cell and the {SAMPLE_BARCODE} is a sample/library-specific identifier. Although the PU is not required by GATK but takes precedence over ID for base recalibration if it is present.

    I was a bit puzzled by this explanation, which seems to contradict what I've read elsewhere about the definition of the PU field. My impression was that this is intended to differentiate (in the case of Illumina data) between flowcell lanes and not to say anything about the sample or library, which are already described in the SM and LB tags. It would thus be in a form like {FLOWCELL_BARCODE}.{LANE} and not {FLOWCELL_BARCODE}.{LANE}.{SAMPLE_BARCODE}. If this interpretation is correct, adding {SAMPLE_BARCODE} to PU would actually make it too specific, since there's no longer a simple way for downstream tools to (e.g.) operate at the flowcell lane level regardless of the sample or library. In fact, doesn't BQSR operate per lane, and might not this be the reason why it prefers PU (when available) to the RG ID?

    According to the SAM/BAM spec:

    "PU: Platform unit (e.g. flowcell-barcode.lane for Illumina or slide for SOLiD). Unique identifer."

    which (as the name 'platform unit' suggests) does not mention sample or library. My interpretation of 'Unique identifier' in this context would be that it refers to the uniquely identified flowcell lane, and not to what is run on it.

    Is this interpretation just wrong, or is the explanation above not quite correct, or does GATK use the PU tag in a non-standard way? Apologies if I'm completely off-base here, or missing something obvious!

    Example - I have a data set with multiple samples, each with a single library, multiplexed within a lane but in some cases also run on multiple lanes to increase the depth of coverage. Armed only with the SAM/BAM spec I would probably have constructed the RG ID for each unique combination of sample, fllowcell and lane in the form {SAMPLE}.{FLOWCELL}.{LANE}, the corresponding PU tag as {FLOWCELL}.{LANE}, the SM tag as {SAMPLE} and the LB as {SAMPLE}.lib. My reading of Geraldine's response to Luisa suggests I may have the contents of the ID and PU tags the wrong way around, but Soo Hee's response to John (and my reading of the SAM spec) suggests that the RG ID must indeed be unique and take into account the identity of the sample, while (as above) it isn't obvious to me that the sample belongs in the PU tag. As far as GATK is concerned this may be a moot point once the samples have been demultiplexed into individual files, but I'd rather use all tags in a standard way if possible to avoid any potential issues with third party tools.

    Grateful for any comments!

    Richard.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hi Richard @RDW,

    I should clarify that the definition of PU given in this article is in fact a fairly narrow definition that corresponds specifically to our usage of PU in our pre-processing pipeline. You're correct that the definition given in the spec is more general and could be interpreted quite differently. I believe it was left intentionally vague because to leave some flexibility in analysis design and platform compatibility.

  • RDWRDW Member

    Hi Geraldine,

    Thanks for your quick response!

    To me, it seems like this 'narrow' definition of PU essentially re-purposes it for a slightly different use, a side effect of which is that it can no longer be used as a straightforward lane identifier, i.e., including a sample identifier in PU seems redundant and makes it less useful for some purposes. It's kind of a shame that PU is not both mandatory and restricted to {flowcell}.{lane}, which would avoid the issue described in the Biostars post below, discussing an Appistry support query on lane identification:

    https://www.biostars.org/p/90317/#92174

    But perhaps there is a good reason for defining PU this way, e.g. to create an identifier that can't be re-written when merging BAMs, etc.

    That aside, I'd like to make absolutely sure I'm using the tags in the way that GATK actually expects, so in my multiplex example above, can you confirm that it is OK to set both ID and PU to {SAMPLE}.{FLOWCELL}.{LANE}?

  • benjaminpelissiebenjaminpelissie Madison, WIMember

    Hi Geraldine. No I didn't worry especially about this pattern, though didn't think of the compression either... Thanks!

  • shleeshlee CambridgeMember, Broadie, Moderator

    @RDW When MarkDuplicates merges BAMs with identical RGIDs, it check the other RG fields, e.g. LB, and if there are differences, it appends a .1 to the identical RGID to differentiate (.2 for the third file etc). So I wouldn't say that RGID is re-written when merging. Rather, the RGID gets appended. This updated RGID is perpetuated to each read record's RG tag.

    Here I'm going out on a limb, and someone may correct me if I'm off-track, but I could imagine some poor sap with not enough data per lane to run BQSR. Their samples were split too many ways across multiple lanes. Let's say theirs was two of many samples per lane, and they ran four lanes of the same two samples (2x4=8 RGIDs). In this case, could this person sacrifice individual sample identities and run four PUs (each representing one of the four lanes) through one instance of BQSR? Just thinking out loud. If the alternative is to forgo BQSR, what is preferable?

  • RDWRDW Member

    Thanks for the clarification on the MarkDuplicates behaviour. Regarding multiplexing, I may be the poor sap here - I have some data with a dozen samples multiplexed per lane, but with samples run on 4 different lanes - i.e., 12 read groups per lane, 4 read groups per sample (don't ask!). How would you suggest I (a) write the ID and PU fields for this type of data and (b) perform BQSR? The discussion in the Biostars link suggests that Appistry have recommended running lane-level BQSR of multiplexed samples in some situations, but it seems I can't just use the PU tag to identify a lane (or can I?).

  • shleeshlee CambridgeMember, Broadie, Moderator

    I have issues with some of the statements made on the Biostars post you've shared. Instead of responding to those statements, let me instead clarify that BQSR controls for lane level errors that are specific to each sequencing run. Another important thing to remember is that BQSR needs enough data to perform well. The original BQSR document says the following:

    - The recalibration system is read-group aware. It separates the covariate data by read group in the recalibration_report.grp file (using @RG tags) and PrintReads will apply this data for each read group in the file. We routinely process BAM files with multiple read groups. Please note that the memory requirements scale linearly with the number of read groups in the file, so that files with many read groups could require a significant amount of RAM to store all of the covariate data.
    
    - A critical determinant of the quality of the recalibation is the number of observed bases and mismatches in each bin. The system will not work well on a small number of aligned reads. We usually expect well in excess of 100M bases from a next-generation DNA sequencer per read group. 1B bases yields significantly better results. 
    

    This thread adds to this that the minimum of 100M total bases refers to targeted bases x coverage and should exclude duplicates. There are of course other read filters that BaseRecalibrator uses but typically duplicates would have the most impact.

    I think you are asking whether you should run (A) 12 instances of BQSR per sample (4 RGIDs per BQSR run), (B) four instances of BQSR per lane (12 RGIDs per BQSR run) or (C) one instance of BQSR per aggregated lane (4 PUs per BQSR run). We recommend (A) or (B). I think the hypothetical scenerio (C) may be better than nothing but is less preferable than the other two options. I think typical runs these days, especially for variant calling, should allow you enough data to run either scenario (A) or (B). Is this not the case for you? For both (A) and (B), if PU is present, it should distinguish the ID groupings. Again, to reiterate, if PU is present, BQSR will use PU over ID as the sample-level covariate. How you fill these fields is up to you.

  • SumuduSumudu Sri LankaMember

    Hi,

    Thank you for all the informative content here.
    I just want to make sure that my understanding for defining RGLB is correct. 8 individual DNA samples were pooled together and sequenced in a single lane in a single run on illumina MiSeq.

    In this case a single LB name can be used for all 8 samples. Below is the @RG tags I used. Fastq file header for sample L23334_S2 is: @M04362:21:000000000-AU567:1:1101:13786:1609 1:N:0:2

    RGID : 000000000-AU567.2.1
    RGPU : 000000000-AU567.2.1
    SM : L23334_S2
    LB : Leish_lib

    Please confirm whether this is correct.

    Next question is, if I prepared two sets each with 4 multiplexed samples (from the 8 samples I have) & run on two different lanes, are these considered as 2 different libraries?

    What if those two sets run on same lane? Should they be considered same library or as two different?

    Appreciate your clarification.
    Thank you.
    Sumudu.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @Sumudu
    Hi Sumudu,

    Sorry for the late response. Can you confirm that the individual samples are not barcoded so you can tell the difference in individuals?

    Thanks,
    Sheila

  • SumuduSumudu Sri LankaMember

    @Sheila

    Hi,
    For reads of each sample a unique combination of Nextera index sequences (I7 and I5) were added before pooling for sequencing. After sequencing I received de-multiplexed samples, for each R1 and R2 separate datasets.

    I hope this is the information you are asking.

    Thanks
    Sumudu

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @Sumudu
    Hi Sumudu,

    I have some general recommendations for you.

    1) The different insert (500 bp insert and 700 bp insert) libraries from the same sample should have different libraries.
    2) Library should reflect sample prep, not lane runs. Lane run differences should be reflected in the RGID.

    I hope this helps.

    -Sheila

  • SumuduSumudu Sri LankaMember
    edited October 2016

    @Sheila

    Thank you Sheila. Its clear now.

    I have done 300bp insert 8 different sample preps. Run them on same lane in a flowcell after pooling. They should be given unique RGLB tags as well.

    Regards
    Sumudu.

  • JuttaJutta Bonn, GermanyMember

    Hey there,
    I followed this discussion, however I couldn't answer all of my questions I have concerning how to determine the read group informations for Illumina RNA-seq data. And I hope you can give me some help!

    I have in total 180 RNA-seq samples, of which 20 samples each were pooled and distributed on 9 lanes of 2 flowcells.

    Based on these data I would like to do SNP calling according to the best practice guide of GATK. I mapped the reads to the reference genome using TopHat2. However, when I would like to see the read group @RG information, I don't get any... (samtools view -H accepted_hits.bam | grep '@RG' source: https://software.broadinstitute.org/gatk/guide/article?id=6472). So I want to add them my self using picard's AddOrReplaceReadGroups.

    The Name of the Fasta file I obtained from our sequencing facility is:
    FCH5Y5TBBXX_L4_HKRDMAIuejTADFRAAPEI-212
    And the header of this file is:
    @K00114:299:H5y5TBBXX:4:1101:1884:1138

    Based on these information I tried to figure out what are my read group information for this sample. According to the description and definition above, I concluded that:
    RGLP: ILLUMINA
    RGID: H5Y5TBBXX.4 (Flowcell + lane name/number)
    RGPU: ? (Flowcell + lane + sample barcode)
    RGSM: samplename1
    RGLB: ?

    For RGPU and RGLB I am not sure how to obtain the right information?
    As RGPU is defined as {FLOWCELL_BARCODE}.{LANE}_{SAMPLE_BARCODE} I am not sure where to find the barcode of the sample in the Name of my fasta or the header of my fasta file?
    And second, where can I find the information for the Library identifier LB? I didn't do the library preparation by my self. Can I get this information somewhere?

    I would be really happy if you could help me solve those questions! Thank you very much in advance,

    Best regards,
    Jutta

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @Jutta
    Hi Jutta,

    Can you confirm what you mean by "pooled"? Do you mean you don't know which sample is which? Or, are the samples individually barcoded, and you know which reads come from each sample?

    Thanks,
    Sheila

  • JuttaJutta Bonn, GermanyMember

    Hi Sheila,

    Sorry, I guess I mean "multiplexed". My samples are individually barcoded, so I know which reads come from which sample.

    Thanks,
    Jutta

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @Jutta
    Hi Jutta,

    Your PL, SM, and ID look fine. For the SM, since you know the individual samples, you will need to make sure the samples are labeled by their names.

    To get the PU (GATK does not require PU), you need to look at the read names in the BAM file, not the FASTA file. Have a look under "Deriving ID and PU fields from read names" in the article.

    The LB tag should group the reads that come from the same sample and were processed together in the library preparation stage. For example, in the article example, the DAD has two LBs (LIB-DAD-1 and LIB-DAD-2). Those distinguish the reads with different insert sizes.

    I hope this helps.

    -Sheila

  • JuttaJutta Bonn, GermanyMember

    Yes, thank you very much Sheila

  • MeliMeli SwedenMember

    This may be a naive question, but do the ID and PU have to match parts of the header in the BAM file, or is it okay to phrases like "flowcell1.lane1.sample25" as long as you keep track of which samples are in the same flowcell/lane?
    Thanks!

  • shleeshlee CambridgeMember, Broadie, Moderator

    @Meli, You may use your alphanumeric phrases to label the fields.

  • Guillaume_DGuillaume_D Lausanne, SwitzerlandMember

    Hi,

    I read all the posts of this page and what to put in each read groups appear quite clear to me now (thanks!) but those discussions also rose some new questions to me.

    Let me first try to explain my quite complicated sequencing design. I have 90 individuals, 6 individuals/lane and each individuals is sequenced in 3 different lanes (1 individuals = 1 library), so finally I have 45 lanes.

    I will add the corrects read groups for each ind of each lane (so 90*3=270 RGID) during the alignment step (bwa-mem in 270 instences).
    Then I have to run MarkDuplicates. If I understood correctly the good way to do it is to merge the 3 files of a same individual and run MarkDuplicates per individual (because 1 individuals = 1 library), so 90 instances of MarkDuplicates, is it right ?
    After that, I want to run the BQSR and it's still not clear to me how to do it (after reading this page and this one : http://gatkforums.broadinstitute.org/gatk/discussion/44/base-quality-score-recalibration-bqsr).
    Should I run it also per individual (so 90 instances of BQSR), per lane (so 45 instances of BQSR) or per lane per individual (so 270 instances of BQSR) ?

    Thanks by advance !
    Guillaume

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie
    Hi Guillaume, you are correct on all points.

    When you run MarkDuplicates on the 3 readgroups of an individual, it will write out a single bam, and you can run BQSR on that bam for that individual. So you will have 90 instances of BQSR.
  • Guillaume_DGuillaume_D Lausanne, SwitzerlandMember

    Hi Geraldine,

    Great, I will do that ! Thank you very much !

    Guillaume

  • ibseqibseq United KingdomMember

    Hi,
    if I have an already sorted BAM file from bwa-mem, can I use it directly for markduplicates provided the validatesamfile doesnt give errors? or do I need to go through the sam file first and add the read group at that stage, sort and then mark duplicates and reconvert to bam and sort again?

    my bam needs @rg, and i though to add it instead of going back to generate the sam file. will this be ok?
    thanks,
    ibseq

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @ibseq
    Hi ibseq,

    You can run AddOrReplaceReadGroups on the BAM file straight from BWA. No need to convert to SAM.

    -Sheila

  • aborabor SwitzerlandMember
    edited March 2017

    Hi
    I have two bam files from the same individual (i.e. the same individual was sequenced twice). Does it make sense to merge the 2 bam files? Does it bring any advantage in terms of coverage? Could it cause problems in the subsequent haplotypecalling and variant calling steps?

    Also I wanted to ask if the following pipeline is correct: I process separately Sample1.bam and Sample2.bam (bwa --> AddOrReplaceReadGroups). Once this is done I use PrintReads to merge Sample1.recal.bam and Sample2.recal.bam, followed by AddOrReplaceReadGroups again to change the tags. Then, on Sample12merged.bam: MarkDuplicates --> Indel realignment --> BQSR. Then I go for the Haplotype calling.

    Thank you very much for your help.

    Post edited by abor on
  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @abor
    Hi,

    I think this article will help.

    -Sheila

  • Hello,

    This forum was really helpful for understanding ReadGroup, but I still have some questions. Your answers will be really appreciated.

    Question 1:

    read group IDs are composed using the flowcell + lane name and number
    The PU holds three types of information, the {FLOWCELL_BARCODE}.{LANE}.{SAMPLE_BARCODE}

    Is "FLOWCELL_BARCODE" in PU different from "flowcell" in readgroup IDs? I assume both represent identical information which is "FlowCell identification". Is it correct?

    Question 2:

    I have two BAMs (I would like to merge) from a same sample (which means a same DNA tube I provided to a sequencing center), but they have different barcode sequences (which means they were sequenced separately or on different times). In this case, I think the two BAMs should have different LB fields (i.e. Sample.Barcode1 and Sample.Barcode2). Is this correct?

    Thank you in advance,

    Hoon

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @wisekh
    Hi Hoon,

    1) The flowcell is the same in both PU and ID.

    2) Yes, that is correct.

    -Sheila

  • cjaln1994cjaln1994 MelbourneMember

    Hi,
    Thanks for all the useful discussion above. However I'm new to the field & still unsure about my particular RG information.
    My situation:
    I have 10 replicates of pooled, RNAseq data for each of two samples (10 x Sample A, 10 x Sample B).
    By pooled data I mean each replicate has mRNA from multiple individuals all mixed together. Replicates were multiplexed across two lanes, so each replicate has a separate Lane 3 and Lane 4 bam file. I was going to use the following format:

    RGID=UniqueReplicateBarcode.LaneNumber
    RGLB=UniqueReplicateBarcode
    RGSM=SampleA (as opposed to SampleB)
    RGPL=Illumina

    I am concerned about setting information correctly for the BQSR stage (I'm not marking dupes due to nature of study). I feel like I am not feeding in enough information about having .bam files from different replicates that were run on the same lane! Does BQSR care about grouping reads based on multiplexing samples into the same lanes? I also considered adding an RGPU field identical to what I have for RGID:

    RGPU=UniqueReplicateBarcode.LaneNumber

    Thanks in advance for any advice,
    Chris

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @cjaln1994
    Hi Chris,

    Sorry for the delay. I am checking with the team and will get back to you asap.

    -Sheila

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @cjaln1994
    Hi Chris,

    I am assuming some things here, so please correct me if I am wrong. You have pooled data for two samples. The pooled data consists of reads from 10 replicates (either biological or technical) that will be analyzed as one single sample. You do know which read comes from which replicate (barcoded reads).

    If that is correct, your information looks fine except for the RGID. You should simply include the flowcell and lane, not the unique replicate barcode.

    If you know which reads come from each individual, you can mark duplicates. We recommend not marking duplicates if you do not know which individual the reads come from. MarkDuplicates will take into account the read group information (specifically LB information).

    BQSR works on reads that come form the same lane. The lane is reflected in the RGID, which is why you should include the flowcell and lane. Including the extra information will cause the reads from the same lane but different libraries to be processed separately. You want to process the lane data together in BQSR. Have a look at this thread for more information.

    -Sheila

  • cjaln1994cjaln1994 MelbourneMember

    Hi @Sheila,
    Thanks so much, very helpful! I'll go back and tweak the read group information. I have posted another question about HaplotypeCaller in a fresh discussion to keep this thread on topic :blush:
    Chris

  • cdomsarcdomsar SpainMember

    Hi,

    Thank you so much for the discussion above. I have read through it and found it very useful, you have a great support team.

    I am experiencing a trouble with my bams, however, since I don't know exactly how should I call my LB groups. I am mapping the raw reads (.fastq files) of the whole genome of a species into the exome of a related species. I am doing this in order to better calculate a nonsynonymous/synonymous mutation ratio in those coding sequences. I want to compare whether mapping these reads with a single-end treatment or paired-end treatment against the exome would be better in order to avoid chimeric alignments and duplicates. I have two paired-end libraries in four .fastq files. One of them has 100-bp-long and the other 250-bp-long inserts (I provide the info of the first read):

    FCC0U47ACXX_7_1 (info: @FCC0U47ACXX:7:1101:1141:1978#AAGTCTCT/1)
    FCC0U47ACXX_7_2 (info: @FCC0U47ACXX:7:1101:1141:1978#AAGTCTCT/2)
    FCC0U61ACXX_5_1 (info: @FCC0U61ACXX:5:1101:1903:2155#AAGTCTCT/1)
    FCC0U61ACXX_5_2 (info: @FCC0U61ACXX:5:1101:1903:2155#AAGTCTCT/2)

    Mapping them with bwa as SE gives me 4 files (FCC0U47ACXX_7_1.bam, FCC0U47ACXX_7_2.bam, FCC0U61ACXX_5_1.bam and FCC0U61ACXX_5_2.bam) and 2 files as PE (FCC0U47ACXX_7.bam and FCC0U61ACXX_5.bam).

    As far as I understand the example with the DAD/MOM/KID libraries, from each sample, two libraries were constructed, and each library was run in a separate lane. What if both lanes of each library were paired-end reads? Let's say, ID:FLOWCELL1.LANE1 had forward reads, and ID:FLOWCELL1.LANE2 had reverse reads. In both cases, we are working with the same library (LIB-DAD-1) and this is why we use the same LB.

    Would it make sense to add these @RG groups to my bams?
    @RG ID:FCC0U47ACXX_7_1 PL:ILLUMINA LB:FCC0U47ACXX_7 SM:SAMPLE1 PI:100
    @RG ID:FCC0U47ACXX_7_2 PL:ILLUMINA LB:FCC0U47ACXX_7 SM:SAMPLE1 PI:100
    @RG ID:FCC0U61ACXX_5_1 PL:ILLUMINA LB:FCC0U61ACXX_5 SM:SAMPLE1 PI:250
    @RG ID:FCC0U61ACXX_5_2 PL:ILLUMINA LB:FCC0U61ACXX_5 SM:SAMPLE1 PI:250

    Let's say I consider that FCC0U47ACXX_7_1 and FCC0U47ACXX_7_2 belong to the same library. If I have overlapping between the forward and reverse reads, wouldn't MarkDuplicates get rid of the second sequence since it would be considered a duplicate?

    Thanks for the support and sorry if the question seems naïve, I am still on my learning curve.

    best regards

    Carlos

  • cdomsarcdomsar SpainMember

    In this case, I am using this command for the first library:

    java -Xmx8G -jar /opt/picard-tools/picard.jar AddOrReplaceReadGroups ID=FCC0U47ACXX_7_1 PL=ILLUMINA LB=FCC0U47ACXX_7 SM=SAMPLE1 PI=100 PU=none SORT_ORDER=coordinate INPUT=FCC0U_7_1.sorted.bam OUTPUT=FCC0U_7_1.sorted.rg.bam

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @cdomsar
    Hi Carlos,

    As far as I understand the example with the DAD/MOM/KID libraries, from each sample, two libraries were constructed, and each library was run in a separate lane. What if both lanes of each library were paired-end reads? Let's say, ID:FLOWCELL1.LANE1 had forward reads, and ID:FLOWCELL1.LANE2 had reverse reads. In both cases, we are working with the same library (LIB-DAD-1) and this is why we use the same LB.

    Yes, the library preparation being the same for those reads means the LB field is the same. The different lanes are reflected in the ID field.

    Would it make sense to add these @RG groups to my bams?

    @RG ID:FCC0U47ACXX_7_1 PL:ILLUMINA LB:FCC0U47ACXX_7 SM:SAMPLE1 PI:100
    @RG ID:FCC0U47ACXX_7_2 PL:ILLUMINA LB:FCC0U47ACXX_7 SM:SAMPLE1 PI:100
    @RG ID:FCC0U61ACXX_5_1 PL:ILLUMINA LB:FCC0U61ACXX_5 SM:SAMPLE1 PI:250
    @RG ID:FCC0U61ACXX_5_2 PL:ILLUMINA LB:FCC0U61ACXX_5 SM:SAMPLE1 PI:250

    Those look correct to me :smile:

    Let's say I consider that FCC0U47ACXX_7_1 and FCC0U47ACXX_7_2 belong to the same library. If I have overlapping between the forward and reverse reads, wouldn't MarkDuplicates get rid of the second sequence since it would be considered a duplicate?

    I don't think MarkDuplicates will remove those reads because it takes into account the start position. Those reads would not have the same start position. Have a look at the MarkDuplicates slide deck in the presentations section for more information.

    -Sheila

  • ErruErru FinlandMember

    Hi,
    I'm having problems with the platform. I have new data from BGISEQ sequencing machine and the program gave an error with that so I changed to 'UNKNOWN' as it was given as a valid option in the error message. The problem is unknown is also giving the error and validateSam gives:

    HISTOGRAM java.lang.String

    Error Type Count
    ERROR:INVALID_PLATFORM_VALUE 1

    So is the unknown a valid options or do I have to claim the data to be something it really isn't to get this working?

    My error from BQSR is:
    ERROR MESSAGE: The platform (UNKNOWN) associated with read group GATKSAMReadGroupRecord @RG:483 is not a recognized platform. Allowable options are ILLUMINA,SLX,SOLEXA,SOLID,454,LS454,COMPLETE,PACBIO,IONTORRENT,CAPILLARY,HELICOS,UNKNOWN

    and my GATK version:
    The Genome Analysis Toolkit (GATK) v3.7-0-gcfedb67, Compiled 2016/12/12 11:21:18

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @Erru
    Hi,

    Hmm. Have you tried using a different value other than UNKNOWN? Does that still throw an error? Can you post an entire read group record here?

    Thanks,
    Sheila

  • ErruErru FinlandMember

    ILLUMINA as platform has worked well before, others I haven't tested.

    I had to change the read groups with picard due to using BGISEQ there when mapping the read. My picard command was for each sample like:

    picard AddOrReplaceReadGroups I=dedup_318.R1.sorted.bam O=dedup_318.sorted.bam RGID=318 RGLB=318 RGPL=UNKNOWN RGPU=unit1 RGSM=318
    (this is how you use picard in my current computing environment )

    Unfortunately BGISEQ doesn't seem to give lane or other info in the fastq files, that's why ID, sample and LB are all the same value.

    Issue · Github
    by Sheila

    Issue Number
    2786
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    sooheelee
  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @Erru
    Hi,

    I will have to check with the team why UNKNOWN does not work. In the meantime, I think you can stick to using ILLUMINA as PL even though it is not true. As long as the PL field is the same for the read groups, you should be fine.

    -Sheila

  • shleeshlee CambridgeMember, Broadie, Moderator

    Hi @Erru,

    I suspect the UNKNOWN option is some artifact of autogenerating the tool documents from the code. It is something the tool's code uses as a value when a field's value does not match any expected value but not a true option available to you. Please use a different value and let us know how it goes.

  • I thought from Erru's message above from December 8th that "COMPLETE" is allowed for RGPL (read group platform), but Picard ValidateSamFile fails to recognize it. I have Complete Genomics data and would like to use an appropriate platform name.

    Is that a discrepancy between Picard and GATK? If so, can I ignore the error from Picard ValidateSamFile and assume that GATK is okay with COMPLETE as the RGPL?

  • shleeshlee CambridgeMember, Broadie, Moderator

    Hi @danfulop,

    Can you tell us which version of Picard is giving you the error with COMPLETE? And can you clarify whether you've tried using it in the PL field with GATK? Thanks.

  • Hi,

    I have read through all the comments and indeed they are very helpful to define the correct @RG for my data. I think I know how to correctly define it but I want to get an expert opinion.

    Experiment:
    I have several tumor samples sequenced on Novaseq. For each sample I get four pairs of fastq files corresponding to four lanes. In the attached image I have described the file name, read names and @RG tags derived from read-names. Can you please help to verify these? Based on your recommendation, I am planning to write a script to automatically fetch @RG information from the read names.

    I have following questions:

    1. As per my understanding @RG-ID should be unique to fastq files from each lane of each run. I have assigned combination of (flowcell.barcode.lane) which is unique for each run/lane/sample. Is this assignment correct?

    2. The @RG-PU field is optional but it gets preference in BQSR. Since, I am assigning same value to ID and PU, can I skip assigning the PU? If it is skipped, will BQSR and other tools will work correctly based on current @RG-ID?

    3. In initial run, samples were using a primary library. But because of some good reasons the remainder of the same library was subject to a number of amplifications. In later runs, amplified library was sequenced. In such case, should I use the same or different library ID for the amplified runs?

    4. For two samples, we have received lower than expected coverage and we will perform additional sequencing by preparing new library. For this data, I will use same @RG-ID convention derived from read_IDs, add different library ID and same sample ID. Is this correct?

    Thank you and I appreciate your time for answering this questions.

  • In above question, I have confusion because for each samples one library is loaded on multiple runs and each run have 4 lanes of data. Therefore, should I add:

    Separate RGID to each run/each lane (i.e flowcell.barcode.lane combination as shown in image above)

    OR

    Separate RGID only to each lane because same library is sequenced across multiple runs (i.e combination of barcode.lane)

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @sutturka
    Hi,

    1/2)Your read names look fine. Note, the PU is not required, so if you want, you can remove the barcode from the ID and leave the PU out entirely. Your way is correct too.

    3) You should use a different library for the newly amplified library.

    4) That is correct.

    -Sheila

  • Note, the PU is not required, so if you want, you can remove the barcode from the ID and leave the PU out entirely. Your way is correct too.

    I think barcode is necessary here. If I exclude the barcode, then RG-ID for the tumor_1 and tumor_98 samples becomes identical. So even with using RG-ID as (flowcell.barcode.lane), I can still exclude the PU without having any impact on BQSR?

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @sutturka
    Hi,

    I think the SM tag will distinguish between the two samples even if the RGID is the same. However, it may be more convenient to keep what you have (including the barcode).

    Yes, you can exclude the PU field, as GATK does not require it. But, if you keep it, PU will take precedence in BQSR (without it BQSR will take into account ID).

    -Sheila

  • freekfreek Member
    edited May 8

    Hi GATK team,

    Thank you for this page, great and insightful discussion.

    After reading everything though I still have a question. In my lab we use an Illumina NextSeq. When loading the library, one puts the entire sample onto a single chip, the sequencer then fills all 4 lanes of this chip with the same library. Thus, after de-multiplexing one can have a fastq file from 1 sample that comes from multiple lanes. The fastq snippets below are from the same fastq file:

    @NB501997:54:HTFJ3BGX3:1:11101:2980:1037 2:N:0:GGCTAC
    
    @NB501997:54:HTFJ3BGX3:4:23612:4923:20392 2:N:0:GGCTAC
    

    You see that the lanes number differs. This means I cannot add the ID to the read group information when mapping using BWA (bwa mem -M -R $RG ...), as it adds 1 single ID per output bam file. So my question is, how do I add this information of lane per read back to the bam file? Or, perhaps, I should treat a NextSeq chip as a single lane and simply use the chip ID as the read group ID because the lanes are connected anyway?

    Highest regards,

    Freek van Hemert.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @freek
    Hi Freek,

    how do I add this information of lane per read back to the bam file?

    The ID captures the lane information. I think you are asking how to change the read group information? If so, you can use AddOrReplaceReadGroups.

    -Sheila

  • freekfreek Member
    edited May 14

    Hi @Sheile,

    Thank you for your response. I was talking about the ID in the Read Group information. I am aware of AddOrReplaceReadGroups, but that program still adds one Read Group line per BAM file and assumes one sample per lane (lane being the smallest unit of technical variance.) But our NextSeq instrument has 4 connected lanes so it produces up to 16 fastq file pairs from all those four lanes combined. I then map them into a BAM file which has the Read Group ID as per your examples: "Chip.lane". However that would mean every single read in the bam would need it's own Read Group because reads from the same sample come from different lanes. The headers from my post above are from one single fastq file and one single sample and one single chip, but from multiple lanes... How would the Read Group information look like for such a sample? I feel the best thing to do is leave out the lanes? Or, I should split the lanes into different BAM files (somehow)?

  • shleeshlee CambridgeMember, Broadie, Moderator

    Hi @freek,

    You have a sample sequenced on four different sequencer lanes that were then melded together into the same FASTQ file.

    Lane level information is important to preserve for base quality score recalibration (BQSR), a step in the Preprocessing Best Practices.

    I believe in our pipelines, lane-level data is separated in to different files. These are processed by IlluminaBasecallsToSam, which labels each file with a unique read group. Given your data is already in FASTQ format, the equivalent for you is to run FastqToSam. However, so far as I know, this tool expects each file to represent one read group. So you will have to separate out your FASTQ reads into four lane-level files. This should be fairly straight-forward given what I've outlined in the Deriving ID and PU fields from read names section of the above document. I had added this section to update the document when I first started with GATK.

    AFAIK, GATK and Picard cannot perform such a demultiplexing of reads. However, perhaps you can search BioStars or SeqAnswers for other programs that can. Alternatively, you could use grep '@NB501997:54:HTFJ3BGX3:1:' and grep @NB501997:54:HTFJ3BGX3:4: to sort the reads with Unix. I hope this is helpful.

  • freekfreek Member

    Hi @shlee,

    Thank you this is a very useful answer and the one I feared a bit :smile:

    There is still one thing I want to make very explicit though before I'm going to change our pipeline this much and so early in the process (possibly even the base calling itself which we do now with Illumina's own bcl2fastq).

    I have the feeling that at broad you usually use sequencing machines with a much higher throughput than our NextSeq500. For example the HiSeq system, which takes a chip with 6 separately loadable lanes. The NextSeq500 however takes smaller chips with 4 lanes, these lanes cannot be loaded separately, the lanes are effectively 1 chamber, filled with 1 library. knowing this, you would still split the reads by lane before doing the base recalibration steps?

    Highest regards,

    Freek

  • shleeshlee CambridgeMember, Broadie, Moderator

    @freek, given

    The NextSeq500 however takes smaller chips with 4 lanes, these lanes cannot be loaded separately, the lanes are effectively 1 chamber, filled with 1 library.

    then it seems to me that this chip is effectively a single lane, subject to the same sequencing chemistry. Are the tile coordinates unique to each of the lanes or reused per lane? This is an important consideration. If the former, then I think you could consider treating the entire chip as one lane. If the latter, then I think you need to treat the lanes as separate read groups. I would have to check if GATK/Picard tool features that use tile coordinates (BQSR and MarkDuplicates) take only the RG/PU and read name tile coordinates or if the lane information of the read name is also a consideration. If the former, then definitely you must treat the lanes as separate read groups. If the latter, then you could possibly get away with the lanes being one read group.

    I have to say, I think you miss out on leveraging the advantages of true multiplexing. For your NextSeq500, this would, for example, look like barcoded library A mixed with barcoded library B run on two different 4-chamber chips resulting in two different runs.

  • freekfreek Member
    edited May 29

    @shlee, Thank you for your answer, it is now clear. I will analyze per lane, just to be sure, and merge the resulting bam files later on. My experiment now looks like: (4 samples, 4 lanes)

    • bwa mem per lane per sample (make 16 bams)
    • Then sort the bams
    • Merge the bams using markDuplicates
    • Sort the bams again (this is needed right?)
    • Run BQSR on the merged bam (which contains reads from a single sample over multiple lanes now but with correct Read Groups so BQSR will be performed correctly)
    • Initial variant calling,

    Highest regards.

  • shleeshlee CambridgeMember, Broadie, Moderator

    Hi @freek,

    You may find this somewhat outdated reference implementation useful. In the illustrated pipeline, you see that newer versions of MarkDuplicates takes in query-name sorted alignments so there is no need for you to sort your BAMs between BWA MEM and MarkDuplicates. After MarkDuplicates merging, sort and fix the tags. The tag fixing tool is now SetNmMdAndUqTags. You can find updated implementations of the preprocessing pipelines at https://github.com/gatk-workflows/.

  • freekfreek Member
    edited May 30

    Hi @shlee,

    The new wdl seem better readable to me and it fixes some t arguments that weren't working in my shell script. I'm still wonder though about the wgs_coverage_interval_list which I don't see defined anywhere (some other files I also can't find), I have asked about this here as well but didn't get a response yet. Can you help me?

    Freek.

  • freekfreek Member

    @shlee
    I did as you adviced and skip sorting after mapping to immdediatly go for merging using MarkDuplicates but I do get an error:

    java.lang.IllegalArgumentException: Alignments added out of order in SAMFileWriterImpl.addAlignment for file:///rst1/2017-0205_illuminaseq/scratch/swo-359/analyzedNoSplitLane/170711_NB501997_0016_AH37NMBGX3/R0016S1/R0016S1.aligned.unsorted.duplicates_marked.bam. Sort order is queryname. Offending records are at [NB501997:16:H37NMBGX3:1:11101:8782:1041] and [NB501997:16:H37NMBGX3:1:11101:25283:1042]
    

    My command is:

    echo -e "### GATK MarkDuplicates on BAM file\n" >> \$log
    echo -e "This step also merges the lanes specific BAM files into a single BAM file\n" >> \$log
    started
    gatk MarkDuplicates \\
        \$markDuplicatesInput \\
        --ASSUME_SORT_ORDER queryname \\
        -O \${bamBaseName}.aligned.unsorted.duplicates_marked.bam \\
        -M \${bamBaseName}.marked_dup_metrics.txt
    finished
    

    where $markDuplicatesInput is a list:

    --INPUT /rst1/2017-0205_illuminaseq/scratch/swo-359/analyzedNoSplitLane/170711_NB501997_0016_AH37NMBGX3/R0016S1/R0016S1.L1.unsorted.bam --INPUT /rst1/2017-0205_illuminaseq/scratch/swo-359/analyzedNoSplitLane/170711_NB501997_0016_AH37NMBGX3/R0016S1/R0016S1.L2.unsorted.bam --INPUT /rst1/2017-0205_illuminaseq/scratch/swo-359/analyzedNoSplitLane/170711_NB501997_0016_AH37NMBGX3/R0016S1/R0016S1.L3.unsorted.bam --INPUT /rst1/2017-0205_illuminaseq/scratch/swo-359/analyzedNoSplitLane/170711_NB501997_0016_AH37NMBGX3/R0016S1/R0016S1.L4.unsorted.bam
    

    I will sort them for now, that means I can skip SetNmMdAndUqTags them? I will re-sort after the merge.

    Highest regards,

    Freek.

  • shleeshlee CambridgeMember, Broadie, Moderator

    Hi @freek,

    Alignments added out of order...Sort order is queryname. Offending records are at [NB501997:16:H37NMBGX3:1:11101:8782:1041] and [NB501997:16:H37NMBGX3:1:11101:25283:1042]

    It appears that your inputs are not queryname-sorted in a manner that MarkDuplicates expects. When you ran your alignment, did you provide BWA queryname sorted FASTQs? BWA outputs exactly in the same order as it reads in reads. It sounds like you found a solution that works. If you are interested in revisiting this error, perhaps try single inputs to MarkDuplicates.

    From your other thread that you link, it looks like you are finding the cloud resources on your own with some digging. I see:

    gsutil ls gs://broad-references/hg38/v0/wgs*
    gs://broad-references/hg38/v0/wgs_calling_regions.hg38.interval_list
    gs://broad-references/hg38/v0/wgs_calling_regions.v1.interval_list
    gs://broad-references/hg38/v0/wgs_coverage_regions.hg38.interval_list
    gs://broad-references/hg38/v0/wgs_evaluation_regions.hg38.interval_list
    gs://broad-references/hg38/v0/wgs_metrics_intervals.interval_list
    

    And so I suspect one of these corresponds to the wgs_coverage_interval_list you refer to but has been renamed. Just be sure to match exactly the usage found in the pipelined scripts and JSON inputs. If in doubt, be sure to look inside the interval_list to make sure you have what you expect.

  • freekfreek Member
    edited June 1

    Hi @shlee

    I answered the last remark in the other thread to stay on topic :smile:

    Freek.

    Post edited by freek on
  • Hi @shlee ,

    Thank you for this helpful discussion. After reading the above questions and responses, I still have two questions about defining read groups.

    1). How do you define 'lane name', which is part of the definition given for ID:
    "read group IDs are composed using the flowcell + 'lane name' and number, making them a globally unique identifier across all sequencing data in the world."

    In Illumina data, I don't see 'lane name' defined, just the lane number is defined:
    @'instrument':'run number':'flow cell id':'lane number':'tile':'x_coord':'y_coord':'read':'isfiltered':'control number':'sample number'

    What do you recommend using for 'lane name'?

    1. Based on the thread discussion, I think I understand that for cases of multiplexing where uniquely-barcoded libraries (corresponding to unique biological samples) are sequenced together in a single lane to save cost, all libraries would belong to the same read group and thus share the same RGID.

    HOWEVER, GATK staff recommends defining a separate read group for each biological sample for purposes of downstream processes.

    In my case, I have 8 paired-end cDNA libraries representing 8 unique biological samples (one library per biological sample) multiplexed in a single Illumina lane.

    Here is an example read name:
    Read1 (F/R), Individual 1
    @J00160:59:HG5Y5BBXX:8:1101:26058:1244 1:N:0:ACATTGGC
    @J00160:59:HG5Y5BBXX:8:1101:26058:1244 2:N:0:ACATTGGC

    Read1(F/R), Individual 2
    @J00160:59:HG5Y5BBXX:8:1101:26118:1244 1:N:0:ATGCCTAA
    @J00160:59:HG5Y5BBXX:8:1101:26118:1244 2:N:0:ATGCCTAA

    With some uncertainty, since I don't understand what 'lane name' means (see question one), I would define ID as 'HG5Y5BBXX8', which is 'flow cel id''lane number' for both individuals following the original article's definition.

    However, the two individual would have different PUs:
    Individual 1
    HG5Y5BBXX8ACATTGGC
    Individual 2
    HG5Y5BBXX8ATGCCTAA

    BUT, based on your recommendations in the above discussion, you would suggest my differentiating the IDs, also. Because of this recommendation, should I just use the PU (above) for the ID field, too?

    In that case:
    Individual 1
    ID: HG5Y5BBXX8ACATTGGC
    PU: HG5Y5BBXX8ACATTGGC

    Individual 2
    ID: HG5Y5BBXX8ATGCCTAA
    PU: HG5Y5BBXX8ATGCCTAA

    Thank you very much for any assistance you can offer.

    CarolineJudy

  • shleeshlee CambridgeMember, Broadie, Moderator

    Absolutely @CarolineJudy, you can use the unique PUs for your RGIDs.

Sign In or Register to comment.