Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Read group ID and PU setup

GERGER Member
edited August 6 in Ask the GATK team

Hi,
In the GATK forum there are many confusing discussions about the difference between read group ID and PU (platform unit) and how to set this up prior to analysis.

The reason for the confusion is because the gatk FAQ and most of the examples assume that only one sample is run on each lane. However, in many situations, multiple samples are run in the same lane (multiplexed). This issue has led to numerous confusing gatk forum discussions over the past few years.

So to clarify for myself and others who are still unsure, when multiple samples are multiplexed on the same lane, should the reads of each sample in the same lane have:

  • The same or different read group IDs?
  • The same or different read group PUs?

This will affect the BQSR step mainly. PU takes precedence over ID at that step.
So the answers to these questions will determine whether BQSR will use all reads in a lane regardless of the sample (even if multiple samples were sequenced on the same lane) OR whether BQSR will use only reads in a lane from the sample being analyzed.

In other words, should BQSR run per sample-lane, or just per lane?

Answers

  • Tiffany_at_BroadTiffany_at_Broad admin Cambridge, MAMember, Administrator, Broadie, Moderator admin

    Hello!
    Sorry, you've found the previous posts confusing. I will respond with a better answer soon.

  • Tiffany_at_BroadTiffany_at_Broad admin Cambridge, MAMember, Administrator, Broadie, Moderator admin

    Hi @GER
    You have likely read this doc already which talks about Illumina sequencing.

    In other words, should BQSR run per sample-lane, or just per lane?

    It should run on each read group which is defined by the sample and lane.

    When multiple samples are multiplexed on the same lane, should the reads of each sample in the same lane have:
    The same or different read group IDs?

    Different - In Illumina data, read group IDs are composed using the flowcell barcode, lane ID, and library barcode. However, all reads that are a part of the same library and sequenced on the same lane/flowcell will have the same read group ID.

    The same or different read group PUs?

    The Same - As the doc says, the PU holds three types of information: flowcell barcode (unique ID), lane (lane of flowcell), and sample barcode (sample/library-specific identifier).

  • GERGER Member

    Thanks for the info. However, this still does not clear up the confusion.
    As described in other documentation, BQSR uses PU when it is defined instead of read group ID.

    Therefore, if PU is defined as 'The Same' for different samples multiplexed on the same lane, then BQSR will run together in one batch on all the data from all the samples that are multiplexed together in one lane.

    But your response said that BQSR should "run on each read group which is defined by the sample and lane", which means that BQSR should run separately for each sample multiplexed within one lane. So your responses are contradictory. On one hand, you state that BQSR should be run separately for each sample within a multiplexed lane. But on the other hand you state that PU should be the same for all samples multiplexed within one lane. Both of these statements cannot be true.

    This topic is clearly still confusing for many people, perhaps even for the creators of the gatk pipeline. Are you able to obtain some more clarity on this?

    Thanks again.

  • Tiffany_at_BroadTiffany_at_Broad admin Cambridge, MAMember, Administrator, Broadie, Moderator admin

    Hi @GER

    You are right! I had to clarify with a few people about this. I am going to gather a few examples and share here and update the doc I pointed to.
    For Illumina data, PU is different for different samples multiplexed on the same lane and read group ID is the same for different samples multiplexed on the same lane because it only gives us flowcell barcode and lane ID.
    @ RG ID:H0164.2
    Flowcell barcode: H0164
    Lane ID: 2

  • GERGER Member

    Thanks. Please let me know once you have an answer.

  • Tiffany_at_BroadTiffany_at_Broad admin Cambridge, MAMember, Administrator, Broadie, Moderator admin

    Hi! That is the answer. What else can I help you with?
    Here are some different examples of how PU can look showing the sample barcode attached at the end after the lane.

    @RG ID:HK35N.1 SM:NA12878 LB:Pond-492100 PL:illumina PU:HK35NCCXX160204.1.TGCTGCTG CN:BI DT:2016-02-04T00:00:00-0500

    Exome dual indexed:
    @RG ID:H025U.1 PL:illumina PU:H025UALXX141009.1.TCCTTGGT-GCTGCACT LB:NexPond-359781 PI:0 DT:2014-10-09T00:00:00-0400 SM:NA12878 CN:BI

  • GERGER Member
    edited August 23

    I see. But I just want to be sure and to clarify the answer relative to my original question.

    --> When multiple samples are multiplex on the same lane, should BQSR use all reads in the lane regardless of the sample OR should BQSR use only reads in a lane from the sample being analyzed?

    The document you updated (https://software.broadinstitute.org/gatk/documentation/article.php?id=6472) is still not clear about this. It states that PU takes precedence over Read Group ID for BQSR if it is specified, but it does not say WHETHER BQSR should be specified.

    The answer to my question above will clarify that. I understand now how Read Group ID and PU are different, but should PU be defined in the first place? That will affect how BQSR is run, and the most important question is what data was BQSR designed and best optimized to take? An entire lane's data, or each sample separately even if multiplexed with other samples in the same lane?

    This is critical to clarify, because PU does NOT have to be defined. So the question is, should it be defined? What subset of data exactly was BQSR designed to process?

  • Tiffany_at_BroadTiffany_at_Broad admin Cambridge, MAMember, Administrator, Broadie, Moderator admin

    Hi @GER -
    I am double-checking with a developer on all your questions next week when he is back from vacation.
    Errors could be correlated with lane or sample, so BQSR should ideally stratify by both sample and lane.

    I'll be in touch next week.

  • Tiffany_at_BroadTiffany_at_Broad admin Cambridge, MAMember, Administrator, Broadie, Moderator admin

    Hi @GER I hope this helps:

    Should PU be defined in the first place (when you have multiplexed data & want to run BQSR)? What data was BQSR designed and best optimized to take?

    Yes, use PU because the RG ID is like a tag, not an identifier. For example, different sample BAMs could have the same RG ID in each separate BAM.

    Divide your samples and run BQSR on each sample individually.

    BQSR will create a different model for different read groups and there is no cross-talk between models.

  • GERGER Member

    Thanks. But can you also clarify the answer to this question below?

    "The most important question is what data was BQSR designed and best optimized to take? An entire lane's data, or each sample separately even if multiplexed with other samples in the same lane?"

  • Tiffany_at_BroadTiffany_at_Broad admin Cambridge, MAMember, Administrator, Broadie, Moderator admin

    I believe the answer here is each sample separately based on the fact that you should divide them and run them through BQSR separately.

Sign In or Register to comment.