Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
Attention:
We will be out of the office on November 11th and 13th 2019, due to the U.S. holiday(Veteran's day) and due to a team event(Nov 13th). We will return to monitoring the GATK forum on November 12th and 14th respectively. Thank you for your patience.

What to do if the read group information is not properly available?

JulsJuls Member ✭✭
edited May 22 in Ask the GATK team

Hi,

There are recommendations how to work with read groups (https://gatkforums.broadinstitute.org/gatk/discussion/6472/read-groups)
However, I was wondering how to proceed if the read group information is not properly available - so when I am working with public data. SRA strips/replaces the read names from the fastq files so I basically only have a run, experiment and biosample ID from SRA. I am aware that working with public data is always difficult but I am trying to find the best possible way to handle this.

Thanks for your input!
Best,

Best Answer

Answers

  • JulsJuls Member ✭✭

    To add to the above:

    Example from https://gatkforums.broadinstitute.org/gatk/discussion/6472/read-groups
    @RG ID:FLOWCELL1.LANE1 PL:ILLUMINA LB:LIB-DAD-1 SM:DAD PI:200

    In the SRA run table is usually also a library name given, which would give me LB. Platform is obviously also given. I've used SRA biosample as SM value and SRA run as a lane.
    BUT I do not have the flow cell. What should I use here? Can I state that I don't know?
    Not sure if I can make use of the SRA experiment ID.

    Thanks

  • bshifawbshifaw Member, Broadie, Moderator admin

    The official answer to this question is to "ask whoever performed the sequencing or provided the BAM to give you the metadata you need." mentioned here.

    As mentioned in the doc you linked:
    "These tags, when assigned appropriately, allow us to differentiate not only samples, but also various technical features that are associated with artifacts. With this information in hand, we can mitigate the effects of those artifacts during the duplicate marking and base recalibration steps. "

    If you are not able to obtain this information and decided to fill it in to the best of your ability given the definitions you stand risk of having artifacts in your results.

  • AdelaideRAdelaideR Member admin

    Hi Juls - The flowcell value is helpful for trouble shooting sequencing errors on the machine. It is an internal check that is not necessary for downstream analysis unless you are working on a new method such as an alternative form of library prep.

    The flowcell information is part of the headers in a fastq file, so you could look for it there, but that may be additional effort that is not required for this downstream analysis.

  • JulsJuls Member ✭✭

    @bshifaw
    @AdelaideR
    Thanks for your answer!

    The flowcell value is helpful for trouble shooting sequencing errors on the machine. It is an internal check that is not necessary for downstream analysis unless you are working on a new method such as an alternative form of library prep.

    So I am a bit confused now. I thought it was recommended so that batch effects or biases in the data that might have been introduced at different stages of the sequencing process can be accounted for - especially during MarkDuplicates and BQSR.

    The flowcell information is part of the headers in a fastq file, so you could look for it there, but that may be additional effort that is not required for this downstream analysis.

    As mentioned above - I am working with public data. SRA strips/replaces the read names from the fastq files so I basically only have a run, experiment and biosample ID from SRA. No flow cell ID - this info is gone.

    If you are not able to obtain this information and decided to fill it in to the best of your ability given the definitions you stand risk of having artifacts in your results.

    Do you have any recommendations for this case? I have used it like this for now:

    In the SRA run table is usually also a library name given, which would give me LB. Platform is obviously also given. I've used SRA biosample as SM value and SRA run as a lane.

    @RG ID:FLOWCELL1.SRA-runAcc PL:ILLUMINA LB:library-name SM:SRA-biosampleAcc

    Thanks so much!!

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin
    edited June 13 Accepted Answer

    Hi @Juls

    1) About the flowcell @AdelaideR is the best person to answer that question.

    2)

    @RG ID:FLOWCELL1.SRA-runAcc PL:ILLUMINA LB:library-name SM:SRA-biosampleAcc

    As mentioned in this doc: https://software.broadinstitute.org/gatk/documentation/article?id=11015, There is no formal definition for read groups and this doc defines the read group requirements in the best way possible. But the example you sent looks good to me.

  • JulsJuls Member ✭✭
    edited July 10

    @AdelaideR
    @bhanuGandham
    @bshifaw
    Again, thank you so much for your answers and your help! This forum offers great support.

Sign In or Register to comment.