Bug Bulletin: The GenomeLocPArser error in SplitNCigarReads has been fixed; if you encounter it, use the latest nightly build.

How read groups affect Variant Calling?

Gaurav1983Gaurav1983 Posts: 10Member
edited January 2013 in Ask the GATK team

Hi,

I have a bam file with multiple read groups for same sample. Does variant calling algorithm (UnifiedGenotyper) will consider bam file as multiple-sample data or a single sample-data (irrespective of read groups) for calling varaints?

Eg.

Read Groups in BAM file:

@RG     ID:41852        PL:illumina     PU:41852        LB:nolib        SM:41852p
@RG     ID:41852.1      PL:illumina     PU:41852        LB:nolib        SM:41852s
@RG     ID:41853        PL:illumina     PU:41853        LB:nolib        SM:41853s
@RG     ID:41854        PL:illumina     PU:41854        LB:nolib        SM:41854p
@RG     ID:41854.4      PL:illumina     PU:41854        LB:nolib        SM:41854s
@RG     ID:41855        PL:illumina     PU:41855        LB:nolib        SM:41855p
@RG     ID:41855.6      PL:illumina     PU:41855        LB:nolib        SM:41855s

Variants call in VCF file

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  41852p  41852s  41853s  41854p  41854s  41855p  41855s

chrM    150     .       T       C       41679.01        PASS    ABHom=0.999;AC=14;AF=1.00;AN=14;BaseQRankSum=1.362;DP=1255;DS;Dels=0.00;FS=0.000;HaplotypeScore=4.6324;MLEAC=14;MLEAF=1.00;MQ=40.73;MQ0=1;MQRankSum=-0.678;OND=3.149e-03;QD=33.21;ReadPosRankSum=1.479;SB=-2.052e+04    GT:AD:DP:GQ:PL  1/1:0,200:200:99:6075,544,0     1/1:0,200:200:99:6315,568,0     1/1:0,200:200:99:7094,599,0     1/1:1,197:198:99:6820,547,0     1/1:0,113:113:99:3624,322,0     1/1:0,200:200:99:7111,599,0     1/1:0,141:141:99:4640,403,0

Regards

Gaurav

Post edited by Geraldine_VdAuwera on
Tagged:

Comments

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,192Administrator, GATK Developer admin

    Hi there,

    The caller will consider reads that have the same sample name (SM) together, even if they belong to different read groups. But the sample name must be the same. In your example file it seems your sample names are all different, with "p" or "s" differentiating samples that have the same number. If you want those to be taken together, you will need to modify the sample names to lose the p or s.

    Geraldine Van der Auwera, PhD

  • rxy712rxy712 Posts: 12Member
    edited May 27

    As I read that RGID refers to the lane information, would my previous definition cause any problem?

    Lane 1: case1-normal case1-tumor1 case1-tumor2 case2-normal
    Lane 2: case2-tumor1 case2-tumor2 .....
    

    I used to define:

    RGID=case1-normal  SM=case1-normal  LB=nolib  PL=illumina
    RGID=case1-tumor1  SM=case1-tumor1  LB=nolib  PL=illumina
    .......
    RGID=case2-tumor2  SM=case2-tumor2  LB=nolib  PL=illumina

    The correct way should be like this, right?

    RGID=lane1  SM=case1-normal  LB=nolib  PL=illumina
    RGID=lane1  SM=case1-tumor1  LB=nolib  PL=illumina
    .......
    RGID=lane2  SM=case2-tumor2  LB=nolib  PL=illumina

    Could you please explain the difference? I am thinking that with the specific lane information, the GATK will treat the 4 samples from lane1 as having the same background, right? Any comment would be very appreciated! Thanks!

    Post edited by rxy712 on
  • SheilaSheila Broad InstitutePosts: 428Member, GATK Developer, Broadie, Moderator admin

    @rxy712

    Hi,

    You are correct that the correct way to label them is the second option.

    In the first option, you are labeling the read groups with different names even though they came from the same lane. You want to label read groups that came from the same lane with the same name, because that is how BQSR distinguishes errors based on lanes. BQSR requires lots of read information and will perform more accurately if it has lots of reads from the same lane so it can determine lane bias. If you label read groups from the same lane with different names, BQSR will not know they have the same bias and will be less accurate.

    Basically, bias occurs in lanes, so the more data you have for a lane, the more accurate BQSR will be in calculating the lane bias.

    -Sheila

  • rxy712rxy712 Posts: 12Member

    Thank you very much for your detailed explanation. That's very helpful. I now have one further question. I read from the SEQAnswers forum that

    If you have one sample (called x) prepped with two different libraries (called y and z), and one library is sequenced on lanes A and B and another is sequenced on lanes B and C (assuming the libraries are indexed), then you can encapsulate all of this information with RG/SM/LB:
    lane A: SM:x,LB:y,RG:Ay
    lane B: SM:x,LB:y,RG:By and SM;x,LB:z,RG:Bz
    lane C: SM:x,LB:z,RG:Cz

    I assume RG means ID. But according to the information on the GATK forum, wouldn't the RGID be A,B or C instead of Ay, By, Bz, or Cz so that BQSR can distinguish different lanes correctly?

    lane A: SM:x,LB:y,RG:A
    lane B: SM:x,LB:y,RG:B and SM;x,LB:z,RG:B
    lane C: SM:x,LB:z,RG:C
  • SheilaSheila Broad InstitutePosts: 428Member, GATK Developer, Broadie, Moderator admin

    @rxy712

    Hello,

    You are correct that the RGID would be A,B, or C.

    -Sheila

  • rxy712rxy712 Posts: 12Member

    I really appreciate your help. Thank you!

  • rxy712rxy712 Posts: 12Member

    I have some further questions: I have realigned tumor/normal samples from the same patient together, but I am not sure how to do the following BQSR and HaplotypeCaller steps. Do I do each individual sample separately or combine tumor/normal samples together in the 2 steps? If together, I do not see nWayOut option in BQSR PrintReads command. If separately, how does BQSR recognize the lane information (i.e. I have 6 samples in two lanes)? Do I need to remove duplicates again after realigning (I do not have library information though)?

    Lane 1: c1-normal c1-tumor1 c1-tumor2 c2-normal
    Lane 2: c2-tumor1 c2-tumor2

    java -jar GenomeAnalysisTK.jar -T RealignerTargetCreator -R human_g1k_v37_decoy.fasta -I c1-normal_rmdup.bam -I c1-tumor1_rmdup.bam -I c1-tumor2_rmdup.bam -known 1000G_phase1.indels.b37.vcf -known Mills_and_1000G_gold_standard.indels.b37.vcf -o c1.realigner.intervals

    java -jar GenomeAnalysisTK.jar -T IndelRealigner -R human_g1k_v37_decoy.fasta -I c1-normal_rmdup.bam -I c1-tumor1_rmdup.bam -I c1-tumor2_rmdup.bam -known 1000G_phase1.indels.b37.vcf -known Mills_and_1000G_gold_standard.indels.b37.vcf -targetIntervals c1.realigner.intervals --nWayOut .realigned.bam

  • SheilaSheila Broad InstitutePosts: 428Member, GATK Developer, Broadie, Moderator admin

    @rxy712

    Hi,

    BaseRecalibrator uses read group IDs to identify lanes, so you can run BQSR on separate lanes. There is no need to mark duplicates again after realignment.

    I assume instead of HaplotypeCaller you will use MuTect because you are working with cancer samples?

    You can check out this FAQ for some more information: http://gatkforums.broadinstitute.org/discussion/3060/which-data-processing-steps-should-i-do-per-lane-vs-per-sample#latest

    -Sheila

  • rxy712rxy712 Posts: 12Member

    Thank you for your comments! I am still a little unclear. (1) If BQSR should be run on the lane level, and I do the 4 samples from lane1 together in one BQSR command, and do the 2 samples from lane2 in another BQSR command, how can I separate the bam files back? I do not see an nWayOut option in the PrintReads step. In that case, c1 and c2 samples are mixed up in the output bam file. (2) if I run one BQSR command on each of the 6 samples (totally 6 BQSR commands), how does BQSR recognize 4 of the 6 are from lane1, and 2 of the 6 are from lane2, as there is only 1 sample in each BQSR command?

    Lane 1: c1-normal c1-tumor1 c1-tumor2 c2-normal
    Lane 2: c2-tumor1 c2-tumor2

    Sorry if I made it too complicated.

  • SheilaSheila Broad InstitutePosts: 428Member, GATK Developer, Broadie, Moderator admin

    @rxy712

    Hi,

    The best solution is to run BaseRecalibrator on the multiple bams, but then apply the recalibration (the PrintReads step) per-sample. The reason this works is that only the model-building step (BaseRecalibrator) has to be run on all the lane bams. Once you have the recalibration table, you can apply the recalibration individually to any bam (as long as they were included in the original model).

    -Sheila

  • SheilaSheila Broad InstitutePosts: 428Member, GATK Developer, Broadie, Moderator admin
  • rxy712rxy712 Posts: 12Member

    That is very clear. Thank you so much for your constant support/help!

Sign In or Register to comment.