This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!
Understanding Read Groups
I am new to GATK and very confused regarding "Read Groups". I have fastq files and I will like to add read groups to the files. This is what I have understood and please do let me know where I am correct or wrong.
for sake of simplicity, let's say I have 2 samples S1, S2 and 3 lanes. This will give rise to 3 fastq files, i.e. S1_L1.fastq.gz, S1_L2.fastq.gz, S1_L3.fastq.gz, S2_L1.fastq.gz, S2_L2.fastq.gz and S2_L3.fastq.gz
Now, this is what is mentioned in sam manual "Each @RG line must have a unique ID"
http://samtools.github.io/hts-specs/SAMv1.pdf and something similar is mentioned in GATK https://gatkforums.broadinstitute.org/gatk/discussion/6472/read-groups
"ID = Read group identifier This tag identifies which read group each read belongs to, so each read group's ID must be unique."
Hence, the read groups should be something like this
RGID=S1_L1 PL=illumina SM=S1, RGID=S1_L2 PL=illumina SM=S1, RGID=S1_L3 PL=illumina SM=S1
RGID=S2_L1 PL=illumina SM=S2, RGID=S1_L2 PL=illumina SM=S2, RGID=S1_L3 PL=illumina SM=S2
This should result in something similar to what has been mentioned by GATK for multiplexed data
@RG ID:FLOWCELL1.LANE1 PL:ILLUMINA LB:LIB-DAD-1 SM:DAD PI:200
@RG ID:FLOWCELL1.LANE2 PL:ILLUMINA LB:LIB-DAD-1 SM:DAD PI:200
@RG ID:FLOWCELL1.LANE3 PL:ILLUMINA LB:LIB-DAD-2 SM:DAD PI:400
@RG ID:FLOWCELL1.LANE4 PL:ILLUMINA LB:LIB-DAD-2 SM:DAD PI:400
@RG ID:FLOWCELL1.LANE5 PL:ILLUMINA LB:LIB-MOM-1 SM:MOM PI:200
@RG ID:FLOWCELL1.LANE6 PL:ILLUMINA LB:LIB-MOM-1 SM:MOM PI:200
@RG ID:FLOWCELL1.LANE7 PL:ILLUMINA LB:LIB-MOM-2 SM:MOM PI:400
@RG ID:FLOWCELL1.LANE8 PL:ILLUMINA LB:LIB-MOM-2 SM:MOM PI:400
In my understanding, the idea behind same SM names is to group data by sample name irrespective of ID which can be different due to lanes etc.
Now, I thought "ID" is used to check for the technical differences due to different lanes. for example, to check if the sequencing for one out of four lanes was faulty and have more PCR duplicates than rest. How will this be possible when each ID is different?
This gets even more confusing when I read following discussion "https://gatkforums.broadinstitute.org/gatk/discussion/2078/how-read-groups-affect-variant-calling?"
In this particular forum, the following way was mentioned as the right approach
RGID=lane1 SM=case1-normal LB=nolib PL=illumina RGID=lane1 SM=case1-tumor1 LB=nolib PL=illumina ....... RGID=lane2 SM=case2-tumor2 LB=nolib PL=illumina
As it can be observed the RGID has two values lane1 and lane2.
Can you kindly let me know what is correct approach and why?
what is the application of RGID, which can help me to make an informed decision
Also will this stay same in GATK4?
Waiting for reply