Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Read group ID and PU setup

GERGER Member
edited August 6 in Ask the GATK team

Hi,
In the GATK forum there are many confusing discussions about the difference between read group ID and PU (platform unit) and how to set this up prior to analysis.

The reason for the confusion is because the gatk FAQ and most of the examples assume that only one sample is run on each lane. However, in many situations, multiple samples are run in the same lane (multiplexed). This issue has led to numerous confusing gatk forum discussions over the past few years.

So to clarify for myself and others who are still unsure, when multiple samples are multiplexed on the same lane, should the reads of each sample in the same lane have:

  • The same or different read group IDs?
  • The same or different read group PUs?

This will affect the BQSR step mainly. PU takes precedence over ID at that step.
So the answers to these questions will determine whether BQSR will use all reads in a lane regardless of the sample (even if multiple samples were sequenced on the same lane) OR whether BQSR will use only reads in a lane from the sample being analyzed.

In other words, should BQSR run per sample-lane, or just per lane?

Answers

  • Tiffany_at_BroadTiffany_at_Broad Cambridge, MAMember, Administrator, Broadie, Moderator admin

    Hello!
    Sorry, you've found the previous posts confusing. I will respond with a better answer soon.

  • Tiffany_at_BroadTiffany_at_Broad Cambridge, MAMember, Administrator, Broadie, Moderator admin

    Hi @GER
    You have likely read this doc already which talks about Illumina sequencing.

    In other words, should BQSR run per sample-lane, or just per lane?

    It should run on each read group which is defined by the sample and lane.

    When multiple samples are multiplexed on the same lane, should the reads of each sample in the same lane have:
    The same or different read group IDs?

    Different - In Illumina data, read group IDs are composed using the flowcell barcode, lane ID, and library barcode. However, all reads that are a part of the same library and sequenced on the same lane/flowcell will have the same read group ID.

    The same or different read group PUs?

    The Same - As the doc says, the PU holds three types of information: flowcell barcode (unique ID), lane (lane of flowcell), and sample barcode (sample/library-specific identifier).

  • GERGER Member

    Thanks for the info. However, this still does not clear up the confusion.
    As described in other documentation, BQSR uses PU when it is defined instead of read group ID.

    Therefore, if PU is defined as 'The Same' for different samples multiplexed on the same lane, then BQSR will run together in one batch on all the data from all the samples that are multiplexed together in one lane.

    But your response said that BQSR should "run on each read group which is defined by the sample and lane", which means that BQSR should run separately for each sample multiplexed within one lane. So your responses are contradictory. On one hand, you state that BQSR should be run separately for each sample within a multiplexed lane. But on the other hand you state that PU should be the same for all samples multiplexed within one lane. Both of these statements cannot be true.

    This topic is clearly still confusing for many people, perhaps even for the creators of the gatk pipeline. Are you able to obtain some more clarity on this?

    Thanks again.

  • Tiffany_at_BroadTiffany_at_Broad Cambridge, MAMember, Administrator, Broadie, Moderator admin

    Hi @GER

    You are right! I had to clarify with a few people about this. I am going to gather a few examples and share here and update the doc I pointed to.
    For Illumina data, PU is different for different samples multiplexed on the same lane and read group ID is the same for different samples multiplexed on the same lane because it only gives us flowcell barcode and lane ID.
    @ RG ID:H0164.2
    Flowcell barcode: H0164
    Lane ID: 2

Sign In or Register to comment.