This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!
Read group PU field used by BaseRecalibrator
Hi, I was wondering about how GATK handles bam read group info, as behaviour I have observed is not as expected from reading the documentation.
I am running a comparison of different illumina platforms, using the same library, so have given my bam files RG info that I think reflects this:
@RG ID:2_1 PL:ILLUMINA PU:Platform1_FCA LB:Sample2 DT:2013-09-01T01:00:00+0100 SM:Sample2_Platform1 @RG ID:2_2 PL:ILLUMINA PU:Platform2_FCB LB:Sample2 DT:2013-07-17T01:00:00+0100 SM:Sample2_Platform2 @RG ID:2_3 PL:ILLUMINA PU:Platform3_FCC LB:Sample2 DT:2014-05-16T01:00:00+0100 SM:Sample2_Platform3
(IDs changed to something more generic)
I have run BaseRecalibrator, and the read group results are broken down by the PU field:
ReadGroup EventType EmpiricalQuality EstimatedQReported Observations Errors Platform1_FCA M 28.2433 28.1998 6283676400.00 9416437.03 Platform1_FCA I 41.2350 45.0000 6283676400.00 472843.68 Platform1_FCA D 40.2707 45.0000 6283676400.00 590401.90 Platform2_FCB M 28.7442 30.5860 258157515.00 344716.76 Platform2_FCB I 44.4728 45.0000 258157515.00 9216.35 Platform2_FCB D 41.1310 45.0000 258157515.00 19896.01 Platform3_FCC M 22.9983 23.2025 2510670817.00 12588158.43 Platform3_FCC I 44.0934 45.0000 2510670817.00 97824.77 Platform3_FCC D 41.1932 45.0000 2510670817.00 190753.01
This is not what I would expect given that the FAQ for GATK states We do not require value for the CN, DS, DT, PG, PI, or PU fields.
I would expect that the RG ID field is used, that is the unique identifier for the Read group, and is by definition unique.
Failing that I would expect the SM field to be used as this is stated as being used in the GATK documentation and is unique in my bam file.
I'm guessing GATK would use the LB field preferentially but as this is not unique in my case it uses something else that is.
I can find no errors or warnings about this in the output or stdout/stderr.
I'm wondering what would happen when I add a second sample, with the same PU IDs?
Can anyone clarify how GATK BaseRecalibrator (and other tools) handle read group fields, especially where LB is the same for multiple read groups?
I need to know if I am using the RG fields correctly for my experiment, and also so that the end results of the GATK pipeline best reflect what I am trying to analyse.