Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Questions regarding adding @RG information in Illumina CASAVA Mapped BAM

ygongygong Member
edited November 2013 in Ask the GATK team

Dear GATK Team,

We received Whole Genome Sequencing data from Illumina and it was mapped by CASAVA.

The problem is the BAM files we have do not have @RG tag.

But we do have the read group information for each sequence. The following is an example:

PROD104_897:1:2201:15530:13154 99 chr1 9999 254 ...

The read name has the following structure:

<instrument>_<fcnumber>:<lane>:<tile>:<xcoord>:<ycoord>

So you can extract the physical lane information by doing something like:

samtools view file.bam | cut -f1 | cut -d: -f1,2

By using Picard Tools, we can add RGSM for each sample. But we can't add lane and flowcell information in this way because you need to split the BAM files into multiple files based on the lane or flowcell information, then add @RG information for each small bam files and then combine them together.

It seems like impossible to do so because we have so many samples and each file is like 110G.

My question is:

  1. Do we have a better way to add @RG information? Or instead of Picard AddOrReplaceReadGroup.jar add the @RG by lane or flowcell, is there a way to manipulate BAM file by sequence, take the read group information, then write @RG tag?

  2. Is it a must for GATK to use the read group information by finding @RG tag? Could I do some modification in GATK code to tell GATK how to read the read group information for this type of data? If so, what part of code should I look into?

Thanks very much.

Best Answer

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi there,

    There is no workaround available in GATK to do what you want, and modifying the GATK to do this would be quite complicated. The lanes should have been tagged properly before being merged. If you do not have access to the pre-merge files, I'm afraid you are going to have to choose between cleaning up the mess the hard way (with some kind of script to extract the reads from different lanes to separate files) or giving up on identifying separate lanes and just assign the same lane ID to all reads for a sample. Obviously we don't recommend the latter, because it will compromise the effectiveness of Base Recalibration. But it's your choice...

  • ygongygong Member

    Thank you, Geraldine.

    As you mentioned, I may need to write some script to extract the reads from different lanes to separate files.

    I'm not sure which kind of scripting method can directly read BAM file (if that is your mean). Or there is an specific tool to do this task.

    If there is a way to read BAM file directly, why can't I just read every reads each time and extract the read group information, then write @RG tag back to the read? Is that a better solution?

    Is there really a scripting method can read BAM files directly? To my best knowledge, I can only use samtools view to read it.

    If not, should I convert BAM file into SAM and read SAM file as text format instead?

    Thanks.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    I don't think you need to decompress to SAM upfront -- you could use samtools to read in each line. But then you'll have to write out each modified line to SAM format, then convert back to BAM, yes.

  • ygongygong Member

    Thank you for your help.
    But I checked samtools document and didn't find how to read a single line each time. Could you please help me?

    With this strategy, does that mean we need to:

    for (each line){
    $line=samtools view .....
    $RG=[modify @RG information]
    $line=$line.$RG
    [print to SAM file]
    }

    Will that be very very slow I think? Since each time we need to launch samtools.

  • ygongygong Member

    Thank you for the detailed explanation for my question.

    I will try to figure it out and also looking for better solution.

    I definitely will update my work to contribute to other people who may have the same question.

  • liuqiliuqi 104 Building, 1# Courtyard, Beichen West Road, Chaoyang District, Beijing, ChinaMember

    @ygong , I meet the same problem, can you tell me how you deal with this problem?

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @liuqi,

    Let me summarize and expand upon the solutions presented by @ygong and @Geraldine_VdAuwera. This assumes read names represent <instrument>_<fcnumber>:<lane>:<tile>:<xcoord>:<ycoord>.

    First, you can figure out the different read groups you will need. You could run this first part on a subset of the coordinate-sorted full-length BAM, e.g. a single smaller chromosome, as it is highly unlikely that some readgroup wouldn't have coverage on a contig.

    samtools view file.bam | cut -f1 | cut -d: -f1,2 
    

    You can take this one step further to get a shortlist of the read groups from the subset BAM.

    samtools view file.bam | cut -f1 | cut -d: -f1,2 | sort | uniq > list_of_readgroups.txt
    

    If your BAM is query-name-sorted, then the command should be run on the entirety of the BAM but you can omit the sort | step.

    Then grep out each read group from the full-length BAM into a separate file, e.g. with

    samtools view file.bam | grep 'somepattern' > one_readgroup.
    

    I imagine there is a way to then stream these reads into a process that would add a header, read group line and tags as well as convert to BAM. Good luck and I hope we hear from @ygong what their solution was.

  • liuqiliuqi 104 Building, 1# Courtyard, Beichen West Road, Chaoyang District, Beijing, ChinaMember

    Thank you for your reply, @shlee

    I will try to do it. Thanks!

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Good luck @liuqi. Also, you may find this thread helpful.

Sign In or Register to comment.