Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
Attention:
We will be out of the office on October 14, 2019, due to the U.S. holiday. We will return to monitoring the forum on October 15.

Question about @RG tags

jwhitejwhite MEEIMember

One of the problems I ran into with GATK is that of @RG tags in our BAM files--or rather, the lack thereof. Is there a way to disable this feature? The software we use to produce BAM files (BWA) does not include RG tags. Practically, what this means is that we can't use the toolkit to replace Samtools, unless we retrofit our BAM files.

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera admin Cambridge, MAMember, Administrator, Broadie admin

    This is addressed in our documentation -- you can add read groups using Picard tools.

  • cidr_downloadcidr_download
    edited November 2012

    Also, BWA-0.5.9 and newer can incorporate @RG tags (and corresponding RG:Z for each record of the BAM file). See below, it is the "-r" flag.

    bwa-0.5.9 sampe
    
    Usage:   bwa sampe [options] <prefix> <in1.sai> <in2.sai> <in1.fq> <in2.fq>
    
    Options: -a INT   maximum insert size [500]
             -o INT   maximum occurrences for one end [100000]
             -n INT   maximum hits to output for paired reads [3]
             -N INT   maximum hits to output for discordant pairs [10]
             -c FLOAT prior of chimeric rate (lower bound) [1.0e-05]
             -f FILE  sam file to output results to [stdout]
             -r STR   read group header line such as `@RG\tID:foo\tSM:bar' [null]
             -P       preload index into memory (for base-space reads only)
             -s       disable Smith-Waterman for the unmapped mate
             -A       disable insert size estimate (force -s)
    
    Notes: 1. For SOLiD reads, <in1.fq> corresponds R3 reads and <in2.fq> to F3.
           2. For reads shorter than 30bp, applying a smaller -o is recommended to
              to get a sensible speed at the cost of pairing accuracy.
    
    bwa-0.5.9 samse
    Usage: bwa samse [-n max_occ] [-f out.sam] [-r RG_line] <prefix> <in.sai> <in.fq>
    
  • jwhitejwhite MEEIMember
    edited January 2013

    Does ReadGroup represent a sample, forward reads or reverse reads a lane from a sequencer?
    In the bwa sampe -r line there is a forward reads file and a reverse reads file. The @RG tag has a single ID: label. What is the @RG tag supposed to represent?
    J.White

  • Geraldine_VdAuweraGeraldine_VdAuwera admin Cambridge, MAMember, Administrator, Broadie admin

    You may be interested to watch our presentation on NGS data and terms here:

    http://www.broadinstitute.org/partnerships/education/broade/best-practices-variant-calling-gatk

  • jfarrelljfarrell ✭✭ Member ✭✭

    I have run into this issue (no RGs in the bam file). One issue to note is that the reads in one bam file (aligned and received from a sequencing center) may have been sequenced on different machines and lanes. The GATK pipeline needs that info for modeling errors. For a whole genome bam file with one sample, we often found reads from 3-4 lanes. So assigning a single RG to all the reads in a BAM file many not get the best out of the pipeline. For the Illumina platform, the lane and machine info is found in the qname of each read. So a unique readgroup for each machine and lane can be generated from that. We have a python script that scans the top of the bam files to check for how many unique read groups (machines and lanes), creates the header with the additional RG info and then adds the appropriate readgroup to each read. It is well worth the extra work. The different RGs can be spotted on the base recalibration plots for an individual BAM file.

  • TechnicalVaultTechnicalVault ✭✭✭ Cambridge, UKMember ✭✭✭
    edited February 2013

    The ID field is just an unique identifier for that read group within a SAM or BAM file and is just there for the the RG:Z on each read to reference. You should really derive any meaningful unique identification from the PU Platform Unit tag instead. @RG ID's and the RG:Z fields that refer to them are often changed when you merge BAMs together. It is also worth bearing in mind that on Illumina machines the unit of division of work is usually the run+lane+barcode as several samples can be put into one lane at once if you multiplex using barcodes. In answer to your question you should end up with one @RG tag per run+lane+barcode combination, forward and reverse cycles combine into one @RG tag and are separated using the 0x40 and 0x80 FLAGS instead.

Sign In or Register to comment.