The current GATK version is 3.8-0
Examples: Monday, today, last week, Mar 26, 3/26/04

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Get notifications!


You can opt in to receive email notifications, for example when your questions get answered or when there are new announcements, by following the instructions given here.

Got a problem?


1. Search using the upper-right search box, e.g. using the error message.
2. Try the latest version of tools.
3. Include tool and Java versions.
4. Tell us whether you are following GATK Best Practices.
5. Include relevant details, e.g. platform, DNA- or RNA-Seq, WES (+capture kit) or WGS (PCR-free or PCR+), paired- or single-end, read length, expected average coverage, somatic data, etc.
6. For tool errors, include the error stacktrace as well as the exact command.
7. For format issues, include the result of running ValidateSamFile for BAMs or ValidateVariants for VCFs.
8. For weird results, include an illustrative example, e.g. attach IGV screenshots according to Article#5484.
9. For a seeming variant that is uncalled, include results of following Article#1235.

Did we ask for a bug report?


Then follow instructions in Article#1894.

Formatting tip!


Wrap blocks of code, error messages and BAM/VCF snippets--especially content with hashes (#)--with lines with three backticks ( ``` ) each to make a code block as demonstrated here.

Jump to another community
Download the latest Picard release at https://github.com/broadinstitute/picard/releases.
GATK version 4.beta.3 (i.e. the third beta release) is out. See the GATK4 beta page for download and details.

How read groups affect Variant Calling?

Gaurav1983Gaurav1983 Member
edited January 2013 in Ask the GATK team

Hi,

I have a bam file with multiple read groups for same sample. Does variant calling algorithm (UnifiedGenotyper) will consider bam file as multiple-sample data or a single sample-data (irrespective of read groups) for calling varaints?

Eg.

Read Groups in BAM file:

@RG     ID:41852        PL:illumina     PU:41852        LB:nolib        SM:41852p
@RG     ID:41852.1      PL:illumina     PU:41852        LB:nolib        SM:41852s
@RG     ID:41853        PL:illumina     PU:41853        LB:nolib        SM:41853s
@RG     ID:41854        PL:illumina     PU:41854        LB:nolib        SM:41854p
@RG     ID:41854.4      PL:illumina     PU:41854        LB:nolib        SM:41854s
@RG     ID:41855        PL:illumina     PU:41855        LB:nolib        SM:41855p
@RG     ID:41855.6      PL:illumina     PU:41855        LB:nolib        SM:41855s

Variants call in VCF file

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  41852p  41852s  41853s  41854p  41854s  41855p  41855s

chrM    150     .       T       C       41679.01        PASS    ABHom=0.999;AC=14;AF=1.00;AN=14;BaseQRankSum=1.362;DP=1255;DS;Dels=0.00;FS=0.000;HaplotypeScore=4.6324;MLEAC=14;MLEAF=1.00;MQ=40.73;MQ0=1;MQRankSum=-0.678;OND=3.149e-03;QD=33.21;ReadPosRankSum=1.479;SB=-2.052e+04    GT:AD:DP:GQ:PL  1/1:0,200:200:99:6075,544,0     1/1:0,200:200:99:6315,568,0     1/1:0,200:200:99:7094,599,0     1/1:1,197:198:99:6820,547,0     1/1:0,113:113:99:3624,322,0     1/1:0,200:200:99:7111,599,0     1/1:0,141:141:99:4640,403,0

Regards

Gaurav

Post edited by Geraldine_VdAuwera on
Tagged:

Comments

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hi there,

    The caller will consider reads that have the same sample name (SM) together, even if they belong to different read groups. But the sample name must be the same. In your example file it seems your sample names are all different, with "p" or "s" differentiating samples that have the same number. If you want those to be taken together, you will need to modify the sample names to lose the p or s.

  • rxy712rxy712 Member
    edited May 2014

    As I read that RGID refers to the lane information, would my previous definition cause any problem?

    Lane 1: case1-normal case1-tumor1 case1-tumor2 case2-normal
    Lane 2: case2-tumor1 case2-tumor2 .....
    

    I used to define:
    RGID=case1-normal SM=case1-normal LB=nolib PL=illumina RGID=case1-tumor1 SM=case1-tumor1 LB=nolib PL=illumina ....... RGID=case2-tumor2 SM=case2-tumor2 LB=nolib PL=illumina

    The correct way should be like this, right?
    RGID=lane1 SM=case1-normal LB=nolib PL=illumina RGID=lane1 SM=case1-tumor1 LB=nolib PL=illumina ....... RGID=lane2 SM=case2-tumor2 LB=nolib PL=illumina

    Could you please explain the difference? I am thinking that with the specific lane information, the GATK will treat the 4 samples from lane1 as having the same background, right?
    Any comment would be very appreciated! Thanks!

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @rxy712‌

    Hi,

    You are correct that the correct way to label them is the second option.

    In the first option, you are labeling the read groups with different names even though they came from the same lane. You want to label read groups that came from the same lane with the same name, because that is how BQSR distinguishes errors based on lanes. BQSR requires lots of read information and will perform more accurately if it has lots of reads from the same lane so it can determine lane bias. If you label read groups from the same lane with different names, BQSR will not know they have the same bias and will be less accurate.

    Basically, bias occurs in lanes, so the more data you have for a lane, the more accurate BQSR will be in calculating the lane bias.

    -Sheila

  • rxy712rxy712 Member

    Thank you very much for your detailed explanation. That's very helpful. I now have one further question. I read from the SEQAnswers forum that
    If you have one sample (called x) prepped with two different libraries (called y and z), and one library is sequenced on lanes A and B and another is sequenced on lanes B and C (assuming the libraries are indexed), then you can encapsulate all of this information with RG/SM/LB: lane A: SM:x,LB:y,RG:Ay lane B: SM:x,LB:y,RG:By and SM;x,LB:z,RG:Bz lane C: SM:x,LB:z,RG:Cz

    I assume RG means ID. But according to the information on the GATK forum, wouldn't the RGID be A,B or C instead of Ay, By, Bz, or Cz so that BQSR can distinguish different lanes correctly?

    lane A: SM:x,LB:y,RG:A lane B: SM:x,LB:y,RG:B and SM;x,LB:z,RG:B lane C: SM:x,LB:z,RG:C

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @rxy712‌

    Hello,

    You are correct that the RGID would be A,B, or C.

    -Sheila

  • rxy712rxy712 Member

    I really appreciate your help. Thank you!

  • rxy712rxy712 Member

    I have some further questions:
    I have realigned tumor/normal samples from the same patient together, but I am not sure how to do the following BQSR and HaplotypeCaller steps. Do I do each individual sample separately or combine tumor/normal samples together in the 2 steps? If together, I do not see nWayOut option in BQSR PrintReads command. If separately, how does BQSR recognize the lane information (i.e. I have 6 samples in two lanes)?
    Do I need to remove duplicates again after realigning (I do not have library information though)?

    Lane 1: c1-normal c1-tumor1 c1-tumor2 c2-normal Lane 2: c2-tumor1 c2-tumor2

    java -jar GenomeAnalysisTK.jar -T RealignerTargetCreator -R human_g1k_v37_decoy.fasta -I c1-normal_rmdup.bam -I c1-tumor1_rmdup.bam -I c1-tumor2_rmdup.bam -known 1000G_phase1.indels.b37.vcf -known Mills_and_1000G_gold_standard.indels.b37.vcf -o c1.realigner.intervals

    java -jar GenomeAnalysisTK.jar -T IndelRealigner -R human_g1k_v37_decoy.fasta -I c1-normal_rmdup.bam -I c1-tumor1_rmdup.bam -I c1-tumor2_rmdup.bam -known 1000G_phase1.indels.b37.vcf -known Mills_and_1000G_gold_standard.indels.b37.vcf -targetIntervals c1.realigner.intervals --nWayOut .realigned.bam

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @rxy712‌

    Hi,

    BaseRecalibrator uses read group IDs to identify lanes, so you can run BQSR on separate lanes. There is no need to mark duplicates again after realignment.

    I assume instead of HaplotypeCaller you will use MuTect because you are working with cancer samples?

    You can check out this FAQ for some more information: http://gatkforums.broadinstitute.org/discussion/3060/which-data-processing-steps-should-i-do-per-lane-vs-per-sample#latest

    -Sheila

  • rxy712rxy712 Member

    Thank you for your comments! I am still a little unclear. (1) If BQSR should be run on the lane level, and I do the 4 samples from lane1 together in one BQSR command, and do the 2 samples from lane2 in another BQSR command, how can I separate the bam files back? I do not see an nWayOut option in the PrintReads step. In that case, c1 and c2 samples are mixed up in the output bam file. (2) if I run one BQSR command on each of the 6 samples (totally 6 BQSR commands), how does BQSR recognize 4 of the 6 are from lane1, and 2 of the 6 are from lane2, as there is only 1 sample in each BQSR command?

    Lane 1: c1-normal c1-tumor1 c1-tumor2 c2-normal Lane 2: c2-tumor1 c2-tumor2

    Sorry if I made it too complicated.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @rxy712‌

    Hi,

    The best solution is to run BaseRecalibrator on the multiple bams, but then apply the recalibration (the PrintReads step) per-sample. The reason this works is that only the model-building step (BaseRecalibrator) has to be run on all the lane bams. Once you have the recalibration table, you can apply the recalibration individually to any bam (as long as they were included in the original model).

    -Sheila

  • rxy712rxy712 Member

    That is very clear. Thank you so much for your constant support/help!

Sign In or Register to comment.