Read group PU field used by BaseRecalibrator

annatannat Member
edited September 2014 in Ask the GATK team

Hi, I was wondering about how GATK handles bam read group info, as behaviour I have observed is not as expected from reading the documentation.
I am running a comparison of different illumina platforms, using the same library, so have given my bam files RG info that I think reflects this:

@RG ID:2_1  PL:ILLUMINA PU:Platform1_FCA    LB:Sample2  DT:2013-09-01T01:00:00+0100 SM:Sample2_Platform1
@RG ID:2_2  PL:ILLUMINA PU:Platform2_FCB    LB:Sample2  DT:2013-07-17T01:00:00+0100 SM:Sample2_Platform2
@RG ID:2_3  PL:ILLUMINA PU:Platform3_FCC    LB:Sample2  DT:2014-05-16T01:00:00+0100 SM:Sample2_Platform3

(IDs changed to something more generic)

I have run BaseRecalibrator, and the read group results are broken down by the PU field:

ReadGroup                    EventType  EmpiricalQuality  EstimatedQReported  Observations   Errors     
Platform1_FCA    M                   28.2433             28.1998  6283676400.00   9416437.03
Platform1_FCA    I                   41.2350             45.0000  6283676400.00    472843.68
Platform1_FCA    D                   40.2707             45.0000  6283676400.00    590401.90
Platform2_FCB  M                   28.7442             30.5860   258157515.00    344716.76
Platform2_FCB  I                   44.4728             45.0000   258157515.00      9216.35
Platform2_FCB  D                   41.1310             45.0000   258157515.00     19896.01
Platform3_FCC     M                   22.9983             23.2025  2510670817.00  12588158.43
Platform3_FCC     I                   44.0934             45.0000  2510670817.00     97824.77
Platform3_FCC     D                   41.1932             45.0000  2510670817.00    190753.01

This is not what I would expect given that the FAQ for GATK states We do not require value for the CN, DS, DT, PG, PI, or PU fields.
I would expect that the RG ID field is used, that is the unique identifier for the Read group, and is by definition unique.
Failing that I would expect the SM field to be used as this is stated as being used in the GATK documentation and is unique in my bam file.
I'm guessing GATK would use the LB field preferentially but as this is not unique in my case it uses something else that is.
I can find no errors or warnings about this in the output or stdout/stderr.

I'm wondering what would happen when I add a second sample, with the same PU IDs?

Can anyone clarify how GATK BaseRecalibrator (and other tools) handle read group fields, especially where LB is the same for multiple read groups?
I need to know if I am using the RG fields correctly for my experiment, and also so that the end results of the GATK pipeline best reflect what I am trying to analyse.

Many thanks


Sign In or Register to comment.