We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

Understanding Read Groups

Dear All,

I am new to GATK and very confused regarding "Read Groups". I have fastq files and I will like to add read groups to the files. This is what I have understood and please do let me know where I am correct or wrong.

for sake of simplicity, let's say I have 2 samples S1, S2 and 3 lanes. This will give rise to 3 fastq files, i.e. S1_L1.fastq.gz, S1_L2.fastq.gz, S1_L3.fastq.gz, S2_L1.fastq.gz, S2_L2.fastq.gz and S2_L3.fastq.gz

Now, this is what is mentioned in sam manual "Each @RG line must have a unique ID"
http://samtools.github.io/hts-specs/SAMv1.pdf and something similar is mentioned in GATK https://gatkforums.broadinstitute.org/gatk/discussion/6472/read-groups
"ID = Read group identifier This tag identifies which read group each read belongs to, so each read group's ID must be unique."

Hence, the read groups should be something like this
RGID=S1_L1 PL=illumina SM=S1, RGID=S1_L2 PL=illumina SM=S1, RGID=S1_L3 PL=illumina SM=S1
RGID=S2_L1 PL=illumina SM=S2, RGID=S1_L2 PL=illumina SM=S2, RGID=S1_L3 PL=illumina SM=S2

This should result in something similar to what has been mentioned by GATK for multiplexed data

Dad's data:

Mom's data:

In my understanding, the idea behind same SM names is to group data by sample name irrespective of ID which can be different due to lanes etc.

Now, I thought "ID" is used to check for the technical differences due to different lanes. for example, to check if the sequencing for one out of four lanes was faulty and have more PCR duplicates than rest. How will this be possible when each ID is different?

This gets even more confusing when I read following discussion "https://gatkforums.broadinstitute.org/gatk/discussion/2078/how-read-groups-affect-variant-calling?"

In this particular forum, the following way was mentioned as the right approach
RGID=lane1 SM=case1-normal LB=nolib PL=illumina RGID=lane1 SM=case1-tumor1 LB=nolib PL=illumina ....... RGID=lane2 SM=case2-tumor2 LB=nolib PL=illumina
As it can be observed the RGID has two values lane1 and lane2.

Can you kindly let me know what is correct approach and why?
what is the application of RGID, which can help me to make an informed decision

Also will this stay same in GATK4?

Waiting for reply :smile:



  • SheilaSheila Broad InstituteMember, Broadie ✭✭✭✭✭


    Your reads groups are correct.

    For the two pre-processing steps MarkDuplicates and BQSR, the documentation says:

    ID is the lowest denominator that differentiates factors contributing to technical batch effects: therefore, a read group is effectively treated as a separate run of the instrument in data processing steps such as base quality score recalibration, since they are assumed to share the same error model.

    MarkDuplicates uses the LB field to determine which read groups might contain molecular duplicates, in case the same DNA library was sequenced on multiple lanes.

    In addition to that the SM tag is also taken into account. In the case of MarkDuplicates, the reason "nolib" works for that case is that the samples are all differently named.

    I hope that helps. And, yes, this is the same in GATK4 :smile:


  • **so according to your reply, I am right in assigning reads groups with unique RGID like this
    RGID=S1_L1 PL=illumina SM=S1
    RGID=S1_L2 PL=illumina SM=S1
    RGID=S1_L3 PL=illumina SM=S1
    RGID=S2_L1 PL=illumina SM=S2
    RGID=S1_L2 PL=illumina SM=S2
    RGID=S1_L3 PL=illumina SM=S2

    then why did you mention "You want to label read groups that came from the same lane with the same name, because that is how BQSR distinguishes errors based on lanes. BQSR requires lots of read information and will perform more accurately if it has lots of reads from the same lane so it can determine lane bias. If you label read groups from the same lane with different names, BQSR will not know they have the same bias and will be less accurate" in the forum https://gatkforums.broadinstitute.org/gatk/discussion/2078/how-read-groups-affect-variant-calling and this will make read groups like following.
    RGID=L1 PL=illumina SM=S1
    RGID=L2 PL=illumina SM=S1
    RGID=L3 PL=illumina SM=S1
    RGID=L1 PL=illumina SM=S2
    RGID=L2 PL=illumina SM=S2
    RGID=L3 PL=illumina SM=S2

    It will be helpful if you or someone else can clarify my doubts in not too techincal language.

  • SheilaSheila Broad InstituteMember, Broadie ✭✭✭✭✭


    I think the issue is that the docs are not quite synced up. The most important thing to note is that ID is the lowest denominator that differentiates factors contributing to technical batch effects. So, you need to include the sample name in there.

    The other statements in there are for different tools and how they use ID. Those are still correct, but when you take into account all tools, sample name does need to be a part of the RGID. I hope this makes sense. I will see if the docs can be made more clear.


  • Frankly, your answer is still far from clarifying my doubts. However, reading other blogs, I have got some understanding of the concept.
    In my personal opinion, it will be helpful if you can make a video explaining this (as well as other complicated terminologies) as I am part of a big group who struggles to understand them. Yes, making a video will take time and resource but eventually will reduce the time as you have to do less explanation.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Thanks for the feedback, @CuriusScientist, we'll consider doing that.

Sign In or Register to comment.