Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

How to check if the RG information is valid?

JulsJuls Member ✭✭
edited July 18 in Ask the GATK team

Dear gatk team,

I have a bam file (merged using markduplicates after bwa mem) with the following RG tags:

view -H file.bam | grep '^@RG'

@RG ID:FLOWCELL1.LANE1  PL:ILLUMINA LB:Lib1 SM:Sample1
@RG ID:FLOWCELL1.LANE2  PL:ILLUMINA LB:Lib1 SM:Sample1
@RG ID:FLOWCELL1.LANE3  PL:ILLUMINA LB:Lib1 SM:Sample1

ValidateSamFile in summary mode reports no errors.
HaplotypeCaller and other gatk tools report no errors.

However, using CollectAlignmentSummaryMetrics with METRIC_ACCUMULATION_LEVEL=READ_GROUP (Version:4.1.2.0) produces this:

CATEGORY            …   SAMPLE  LIBRARY READ_GROUP
FIRST_OF_PAIR       …           
SECOND_OF_PAIR      …           
PAIR                …           
UNPAIRED            …   Sample1 Lib1    
FIRST_OF_PAIR       …   unknown unknown unknown
SECOND_OF_PAIR      …   unknown unknown unknown
PAIR                …   unknown unknown unknown

However, samtools stats --split RG file.bam | grep '^SN' gives me bamstat files for all three read groups with the proper number of reads.

file.bam_FLOWCELL1.LANE1.bamstat
file.bam_FLOWCELL1.LANE2.bamstat
file.bam_FLOWCELL1.LANE3.bamstat

I would have expected something similar from the CollectAlignmentSummaryMetrics output.

Did I miss something? How can I check that the read group information I've provided is valid and has been accordingly used by the gatk tools?
Could it have something to do with the fact that I did not state the PU field? Is this a problem? I thought gatk does not require it (see here).
Thank you very much for your help!

UPDATE

I re-ran CollectAlignmentSummaryMetrics with all possible values for METRIC_ACCUMULATION_LEVEL so : {ALL_READS, SAMPLE, LIBRARY, READ_GROUP}, which produces this:

CATEGORY            …   SAMPLE  LIBRARY READ_GROUP
FIRST_OF_PAIR       …           
SECOND_OF_PAIR      …           
PAIR                …           
FIRST_OF_PAIR       …   Sample1     
SECOND_OF_PAIR      …   Sample1     
PAIR                …   Sample1     
FIRST_OF_PAIR       …   Sample1 Lib1    
SECOND_OF_PAIR      …   Sample1 Lib1    
PAIR                …   Sample1 Lib1    
UNPAIRED            …   Sample1 Lib1    
FIRST_OF_PAIR       …   unknown unknown unknown
SECOND_OF_PAIR      …   unknown unknown unknown
PAIR                …   unknown unknown unknown

So Sample and Library are recognised just not the RG...

Thx!

Post edited by Juls on

Issue · Github
by bhanuGandham

Issue Number
1372
State
closed
Last Updated
Closed By
bhanugandham

Answers

  • JulsJuls Member ✭✭

    Dear gatk team,
    I was wondering if there are any updates on this?
    Thank you so much for your help!!!!

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @Juls

    I am looking into this and will get back to you shortly.

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    HI @Juls

    That looks like a bug to me. I have created a issue ticket for it on github. You can follow and comment on the progress of the issue here: https://github.com/broadinstitute/picard/issues/1372

  • JulsJuls Member ✭✭

    Thanks for the update!

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @Juls

    1. does the file validate by picard?
    2. do the reads (not the header) have a RG tag?
  • JulsJuls Member ✭✭
    edited August 8

    Hi @bhanuGandham

    1. yes ValidateSamFile in summary mode reports no errors.
    2. yes the reads have RG tags - these look for example like this: RG:Z:FLOWCELL1.LANE2 and as mentioned above samtools stats --split RG file.bam | grep '^SN' gives me bamstat files for all three read groups with the proper number of reads. The tags of the reads also correspond to the ones in the header: fore e.g. RG:Z:FLOWCELL1.LANE2 - the header is @RG ID:FLOWCELL1.LANE2 PL:ILLUMINA LB:Lib1 SM:Sample1
      Could it have something to do with the fact that I did not state the PU field? Is this a problem?

    Thanks so much!

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @Juls

    Would you please post a few reads from the input bam with the RG tags.

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    @Juls

    You could try to add the PU fields to see if this resolves the issue. Take a look at this tool to add and replace read groups: https://software.broadinstitute.org/gatk/documentation/tooldocs/current/picard_sam_AddOrReplaceReadGroups.php

  • JulsJuls Member ✭✭
    edited August 14

    @bhanuGandham

    I've re-run my mappings (bwa mem -M -R) with updated RGs:

    @RG ID:FLOWCELL1.LANE1  PU:FLOWCELL1.LANE1.Sample1 PL:ILLUMINA LB:Lib1 SM:Sample1
    @RG ID:FLOWCELL1.LANE2  PU:FLOWCELL1.LANE1.Sample1 PL:ILLUMINA LB:Lib1 SM:Sample1
    @RG ID:FLOWCELL1.LANE3  PU:FLOWCELL1.LANE1.Sample1 PL:ILLUMINA LB:Lib1 SM:Sample1
    

    And CollectAlignmentSummaryMetrics works perfectly now.
    So to clear things up:
    MarkDuplicates needs the LB field
    CollectAlignmentSummaryMetrics needs the PU field
    BaseRecalibrator needs the PU or ID field
    HaplotypeCaller needs the PU or ID field
    Mutect2 needs the PU or ID field
    and the SM field is used for sample definition in the resulting vcf files.
    Is this correct?
    If I use an alternative to CollectAlignmentSummaryMetrics, I do not need to re-run all my samples, as the GATK tools work just with the ID field (and no PU field) just fine. Is this correct?

    Will the presence of the PU field lead to different results as just the ID field?
    I mean the PU is slightly differently defined as the ID field:
    {FLOWCELL_BARCODE}.{LANE}.{SAMPLE}, where sample is the sample or library specific identifier. So I could also add the library to the PU field and define my RG like this if I have multiple libs:

    @RG ID:FLOWCELL1.LANE1  PU:FLOWCELL1.LANE1.Sample2Lib1 PL:ILLUMINA LB:Lib1 SM:Sample2
    @RG ID:FLOWCELL1.LANE2  PU:FLOWCELL1.LANE2.Sample2Lib2 PL:ILLUMINA LB:Lib2 SM:Sample2
    @RG ID:FLOWCELL1.LANE3  PU:FLOWCELL1.LANE3.Sample2Lib2 PL:ILLUMINA LB:Lib2 SM:Sample2
    

    Would this make a difference? I mean do the tools (BaseRecalibrator, HaplotypeCaller, Mutect2) then use lib and lane as covariate or would this lead to the same results essentially as the RGs just define a separate batch?

    Thanks!

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @Juls

    This doc: https://software.broadinstitute.org/gatk/documentation/article?id=11015 lists all the read group fields that are required by GATK tools. And in your case looks like PU filed was causing the error.
    Reads groups are essentially used in the duplicate marking and base recalibration steps.

  • JulsJuls Member ✭✭
    edited August 16

    hi @bhanuGandham

    Thank you for your input!

    So, it's save to assume that the analysis worked fine without the PU field (apart from CollectAlignmentSummaryMetrics). Correct? All other tools did not report an error.
    It would be great if CollectAlignmentSummaryMetrics could also work with the ID field :)

    In the article you mention, they say:

    In the example shown earlier, two read group fields, ID and PU, appropriately differentiate flow cell lane, marked by .2, a factor that contributes to batch effects.

    The stated example is a simple one with the same lib but different lanes, where different lanes define different batches.
    However, I would like to know if the gatk tools (BaseRecalibrator, HaplotypeCaller, Mutect2) would use lib AND lane as covariate if I state the PU field as well or would this lead to the same results essentially as the RGs just define a separate batch?

Sign In or Register to comment.