The current GATK version is 3.6-0
Examples: Monday, today, last week, Mar 26, 3/26/04

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

# How read groups affect Variant Calling?

Posts: 12Member
edited January 2013

Hi,

I have a bam file with multiple read groups for same sample. Does variant calling algorithm (UnifiedGenotyper) will consider bam file as multiple-sample data or a single sample-data (irrespective of read groups) for calling varaints?

Eg.

@RG     ID:41852        PL:illumina     PU:41852        LB:nolib        SM:41852p
@RG     ID:41852.1      PL:illumina     PU:41852        LB:nolib        SM:41852s
@RG     ID:41853        PL:illumina     PU:41853        LB:nolib        SM:41853s
@RG     ID:41854        PL:illumina     PU:41854        LB:nolib        SM:41854p
@RG     ID:41854.4      PL:illumina     PU:41854        LB:nolib        SM:41854s
@RG     ID:41855        PL:illumina     PU:41855        LB:nolib        SM:41855p
@RG     ID:41855.6      PL:illumina     PU:41855        LB:nolib        SM:41855s


Variants call in VCF file

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  41852p  41852s  41853s  41854p  41854s  41855p  41855s

chrM    150     .       T       C       41679.01        PASS    ABHom=0.999;AC=14;AF=1.00;AN=14;BaseQRankSum=1.362;DP=1255;DS;Dels=0.00;FS=0.000;HaplotypeScore=4.6324;MLEAC=14;MLEAF=1.00;MQ=40.73;MQ0=1;MQRankSum=-0.678;OND=3.149e-03;QD=33.21;ReadPosRankSum=1.479;SB=-2.052e+04    GT:AD:DP:GQ:PL  1/1:0,200:200:99:6075,544,0     1/1:0,200:200:99:6315,568,0     1/1:0,200:200:99:7094,599,0     1/1:1,197:198:99:6820,547,0     1/1:0,113:113:99:3624,322,0     1/1:0,200:200:99:7111,599,0     1/1:0,141:141:99:4640,403,0


Regards

Gaurav

Post edited by Geraldine_VdAuwera on
Tagged:

Hi there,

The caller will consider reads that have the same sample name (SM) together, even if they belong to different read groups. But the sample name must be the same. In your example file it seems your sample names are all different, with "p" or "s" differentiating samples that have the same number. If you want those to be taken together, you will need to modify the sample names to lose the p or s.

Geraldine Van der Auwera, PhD

• Posts: 18Member
edited May 2014

As I read that RGID refers to the lane information, would my previous definition cause any problem?

Lane 1: case1-normal case1-tumor1 case1-tumor2 case2-normal
Lane 2: case2-tumor1 case2-tumor2 .....


I used to define:

RGID=case1-normal  SM=case1-normal  LB=nolib  PL=illumina
RGID=case1-tumor1  SM=case1-tumor1  LB=nolib  PL=illumina
.......
RGID=case2-tumor2  SM=case2-tumor2  LB=nolib  PL=illumina

The correct way should be like this, right?

RGID=lane1  SM=case1-normal  LB=nolib  PL=illumina
RGID=lane1  SM=case1-tumor1  LB=nolib  PL=illumina
.......
RGID=lane2  SM=case2-tumor2  LB=nolib  PL=illumina

Could you please explain the difference? I am thinking that with the specific lane information, the GATK will treat the 4 samples from lane1 as having the same background, right?
Any comment would be very appreciated! Thanks!

@rxy712‌

Hi,

You are correct that the correct way to label them is the second option.

In the first option, you are labeling the read groups with different names even though they came from the same lane. You want to label read groups that came from the same lane with the same name, because that is how BQSR distinguishes errors based on lanes. BQSR requires lots of read information and will perform more accurately if it has lots of reads from the same lane so it can determine lane bias. If you label read groups from the same lane with different names, BQSR will not know they have the same bias and will be less accurate.

Basically, bias occurs in lanes, so the more data you have for a lane, the more accurate BQSR will be in calculating the lane bias.

-Sheila

• Posts: 18Member

Thank you very much for your detailed explanation. That's very helpful. I now have one further question. I read from the SEQAnswers forum that

If you have one sample (called x) prepped with two different libraries (called y and z), and one library is sequenced on lanes A and B and another is sequenced on lanes B and C (assuming the libraries are indexed), then you can encapsulate all of this information with RG/SM/LB:
lane A: SM:x,LB:y,RG:Ay
lane B: SM:x,LB:y,RG:By and SM;x,LB:z,RG:Bz
lane C: SM:x,LB:z,RG:Cz

I assume RG means ID. But according to the information on the GATK forum, wouldn't the RGID be A,B or C instead of Ay, By, Bz, or Cz so that BQSR can distinguish different lanes correctly?

lane A: SM:x,LB:y,RG:A
lane B: SM:x,LB:y,RG:B and SM;x,LB:z,RG:B
lane C: SM:x,LB:z,RG:C

@rxy712‌

Hello,

You are correct that the RGID would be A,B, or C.

-Sheila

• Posts: 18Member

I really appreciate your help. Thank you!

• Posts: 18Member

I have some further questions:
I have realigned tumor/normal samples from the same patient together, but I am not sure how to do the following BQSR and HaplotypeCaller steps. Do I do each individual sample separately or combine tumor/normal samples together in the 2 steps? If together, I do not see nWayOut option in BQSR PrintReads command. If separately, how does BQSR recognize the lane information (i.e. I have 6 samples in two lanes)?
Do I need to remove duplicates again after realigning (I do not have library information though)?

Lane 1: c1-normal c1-tumor1 c1-tumor2 c2-normal
Lane 2: c2-tumor1 c2-tumor2

java -jar GenomeAnalysisTK.jar -T RealignerTargetCreator -R human_g1k_v37_decoy.fasta -I c1-normal_rmdup.bam -I c1-tumor1_rmdup.bam -I c1-tumor2_rmdup.bam -known 1000G_phase1.indels.b37.vcf -known Mills_and_1000G_gold_standard.indels.b37.vcf -o c1.realigner.intervals

java -jar GenomeAnalysisTK.jar -T IndelRealigner -R human_g1k_v37_decoy.fasta -I c1-normal_rmdup.bam -I c1-tumor1_rmdup.bam -I c1-tumor2_rmdup.bam -known 1000G_phase1.indels.b37.vcf -known Mills_and_1000G_gold_standard.indels.b37.vcf -targetIntervals c1.realigner.intervals --nWayOut .realigned.bam

@rxy712‌

Hi,

BaseRecalibrator uses read group IDs to identify lanes, so you can run BQSR on separate lanes. There is no need to mark duplicates again after realignment.

I assume instead of HaplotypeCaller you will use MuTect because you are working with cancer samples?

-Sheila

• Posts: 18Member

Thank you for your comments! I am still a little unclear. (1) If BQSR should be run on the lane level, and I do the 4 samples from lane1 together in one BQSR command, and do the 2 samples from lane2 in another BQSR command, how can I separate the bam files back? I do not see an nWayOut option in the PrintReads step. In that case, c1 and c2 samples are mixed up in the output bam file. (2) if I run one BQSR command on each of the 6 samples (totally 6 BQSR commands), how does BQSR recognize 4 of the 6 are from lane1, and 2 of the 6 are from lane2, as there is only 1 sample in each BQSR command?

Lane 1: c1-normal c1-tumor1 c1-tumor2 c2-normal
Lane 2: c2-tumor1 c2-tumor2

Sorry if I made it too complicated.

@rxy712‌

Hi,

The best solution is to run BaseRecalibrator on the multiple bams, but then apply the recalibration (the PrintReads step) per-sample. The reason this works is that only the model-building step (BaseRecalibrator) has to be run on all the lane bams. Once you have the recalibration table, you can apply the recalibration individually to any bam (as long as they were included in the original model).

-Sheila