We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!
Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
We will be out of the office for a Broad Institute event from Dec 10th to Dec 11th 2019. We will be back to monitor the GATK forum on Dec 12th 2019. In the meantime we encourage you to help out other community members with their queries.
Thank you for your patience!
How to check if the RG information is valid?

Dear gatk team,
I have a bam file (merged using markduplicates after bwa mem) with the following RG tags:
view -H file.bam | grep '^@RG'
@RG ID:FLOWCELL1.LANE1 PL:ILLUMINA LB:Lib1 SM:Sample1 @RG ID:FLOWCELL1.LANE2 PL:ILLUMINA LB:Lib1 SM:Sample1 @RG ID:FLOWCELL1.LANE3 PL:ILLUMINA LB:Lib1 SM:Sample1
ValidateSamFile
in summary mode reports no errors.HaplotypeCaller
and other gatk tools report no errors.
However, using CollectAlignmentSummaryMetrics
with METRIC_ACCUMULATION_LEVEL=READ_GROUP
(Version:4.1.2.0) produces this:
CATEGORY … SAMPLE LIBRARY READ_GROUP FIRST_OF_PAIR … SECOND_OF_PAIR … PAIR … UNPAIRED … Sample1 Lib1 FIRST_OF_PAIR … unknown unknown unknown SECOND_OF_PAIR … unknown unknown unknown PAIR … unknown unknown unknown
However, samtools stats --split RG file.bam | grep '^SN'
gives me bamstat files for all three read groups with the proper number of reads.
file.bam_FLOWCELL1.LANE1.bamstat file.bam_FLOWCELL1.LANE2.bamstat file.bam_FLOWCELL1.LANE3.bamstat
I would have expected something similar from the CollectAlignmentSummaryMetrics output.
Did I miss something? How can I check that the read group information I've provided is valid and has been accordingly used by the gatk tools?
Could it have something to do with the fact that I did not state the PU field? Is this a problem? I thought gatk does not require it (see here).
Thank you very much for your help!
UPDATE
I re-ran CollectAlignmentSummaryMetrics
with all possible values for METRIC_ACCUMULATION_LEVEL
so : {ALL_READS, SAMPLE, LIBRARY, READ_GROUP}
, which produces this:
CATEGORY … SAMPLE LIBRARY READ_GROUP FIRST_OF_PAIR … SECOND_OF_PAIR … PAIR … FIRST_OF_PAIR … Sample1 SECOND_OF_PAIR … Sample1 PAIR … Sample1 FIRST_OF_PAIR … Sample1 Lib1 SECOND_OF_PAIR … Sample1 Lib1 PAIR … Sample1 Lib1 UNPAIRED … Sample1 Lib1 FIRST_OF_PAIR … unknown unknown unknown SECOND_OF_PAIR … unknown unknown unknown PAIR … unknown unknown unknown
So Sample and Library are recognised just not the RG...
Thx!
Answers
Dear gatk team,
I was wondering if there are any updates on this?
Thank you so much for your help!!!!
Hi @Juls
I am looking into this and will get back to you shortly.
HI @Juls
That looks like a bug to me. I have created a issue ticket for it on github. You can follow and comment on the progress of the issue here: https://github.com/broadinstitute/picard/issues/1372
Thanks for the update!
Hi @Juls
Hi @bhanuGandham
ValidateSamFile
in summary mode reports no errors.RG:Z:FLOWCELL1.LANE2
and as mentioned abovesamtools stats --split RG file.bam | grep '^SN'
gives me bamstat files for all three read groups with the proper number of reads. The tags of the reads also correspond to the ones in the header: fore e.g.RG:Z:FLOWCELL1.LANE2
- the header is@RG ID:FLOWCELL1.LANE2 PL:ILLUMINA LB:Lib1 SM:Sample1
Could it have something to do with the fact that I did not state the PU field? Is this a problem?
Thanks so much!
Hi @Juls
Would you please post a few reads from the input bam with the RG tags.
@Juls
You could try to add the PU fields to see if this resolves the issue. Take a look at this tool to add and replace read groups: https://software.broadinstitute.org/gatk/documentation/tooldocs/current/picard_sam_AddOrReplaceReadGroups.php
@bhanuGandham
I've re-run my mappings (
bwa mem -M -R
) with updated RGs:And
CollectAlignmentSummaryMetrics
works perfectly now.So to clear things up:
MarkDuplicates
needs theLB
fieldCollectAlignmentSummaryMetrics
needs thePU
fieldBaseRecalibrator
needs thePU
orID
fieldHaplotypeCaller
needs thePU
orID
fieldMutect2
needs thePU
orID
fieldand the
SM
field is used for sample definition in the resulting vcf files.Is this correct?
If I use an alternative to
CollectAlignmentSummaryMetrics
, I do not need to re-run all my samples, as the GATK tools work just with theID
field (and noPU
field) just fine. Is this correct?Will the presence of the
PU
field lead to different results as just theID
field?I mean the
PU
is slightly differently defined as theID
field:{FLOWCELL_BARCODE}.{LANE}.{SAMPLE}, where sample is the sample or library specific identifier. So I could also add the library to the
PU
field and define my RG like this if I have multiple libs:Would this make a difference? I mean do the tools (
BaseRecalibrator
,HaplotypeCaller
,Mutect2
) then use lib and lane as covariate or would this lead to the same results essentially as the RGs just define a separate batch?Thanks!
Hi @Juls
This doc: https://software.broadinstitute.org/gatk/documentation/article?id=11015 lists all the read group fields that are required by GATK tools. And in your case looks like PU filed was causing the error.
Reads groups are essentially used in the duplicate marking and base recalibration steps.
hi @bhanuGandham
Thank you for your input!
So, it's save to assume that the analysis worked fine without the
PU
field (apart fromCollectAlignmentSummaryMetrics
). Correct? All other tools did not report an error.It would be great if
CollectAlignmentSummaryMetrics
could also work with theID
fieldIn the article you mention, they say:
The stated example is a simple one with the same lib but different lanes, where different lanes define different batches.
However, I would like to know if the gatk tools (
BaseRecalibrator
,HaplotypeCaller
,Mutect2
) would use lib AND lane as covariate if I state thePU
field as well or would this lead to the same results essentially as the RGs just define a separate batch?