Questions regarding adding @RG information in Illumina CASAVA Mapped BAM

Dear GATK Team,
We received Whole Genome Sequencing data from Illumina and it was mapped by CASAVA.
The problem is the BAM files we have do not have @RG tag.
But we do have the read group information for each sequence. The following is an example:
PROD104_897:1:2201:15530:13154 99 chr1 9999 254 ...
The read name has the following structure:
<instrument>_<fcnumber>:<lane>:<tile>:<xcoord>:<ycoord>
So you can extract the physical lane information by doing something like:
samtools view file.bam | cut -f1 | cut -d: -f1,2
By using Picard Tools, we can add RGSM for each sample. But we can't add lane and flowcell information in this way because you need to split the BAM files into multiple files based on the lane or flowcell information, then add @RG information for each small bam files and then combine them together.
It seems like impossible to do so because we have so many samples and each file is like 110G.
My question is:
Do we have a better way to add @RG information? Or instead of Picard AddOrReplaceReadGroup.jar add the @RG by lane or flowcell, is there a way to manipulate BAM file by sequence, take the read group information, then write @RG tag?
Is it a must for GATK to use the read group information by finding @RG tag? Could I do some modification in GATK code to tell GATK how to read the read group information for this type of data? If so, what part of code should I look into?
Thanks very much.
Best Answer
-
Geraldine_VdAuwera Cambridge, MA admin
In the unix shell you can pipe the lines from samtools to serve as input for another program, script or whatever. So as samtools reads in lines, it passes them on for processing. For example, if you do
samtools view -h your_data.bam | grep "HWI-" | sed s/HWI-/TEST-/ > test.txt
this will make samtools read each line and pass it to grep, which will check if it contains the phrase "HWI-" (a common read name starter) and if yes, it will pass that line on to sed, which will perform some kind of pattern substitution and write out the modified line to a given file.
This is just an example of the unix cmdline tool capabilities; you'll need to tweak it depending on what you want to do exactly. You can use just the grep part to write reads to separate pseudo-SAM files per lane and then re-header them and run through AddOrReplaceReadGroups, or you can skip the grep part and do the substitution in place with sed if you're comfortable with regular expressions, and just have a single file to re-header and convert back to BAM.
Note that this is just off the top of my head and there may be better solutions, or there may be major flaws with this approach that I haven't thought of. Since your question falls outside of scope of GATK support I can't really provide you with a full solution. But this should get you started, and others in the community may contribute better solutions.
Answers
Hi there,
There is no workaround available in GATK to do what you want, and modifying the GATK to do this would be quite complicated. The lanes should have been tagged properly before being merged. If you do not have access to the pre-merge files, I'm afraid you are going to have to choose between cleaning up the mess the hard way (with some kind of script to extract the reads from different lanes to separate files) or giving up on identifying separate lanes and just assign the same lane ID to all reads for a sample. Obviously we don't recommend the latter, because it will compromise the effectiveness of Base Recalibration. But it's your choice...
Thank you, Geraldine.
As you mentioned, I may need to write some script to extract the reads from different lanes to separate files.
I'm not sure which kind of scripting method can directly read BAM file (if that is your mean). Or there is an specific tool to do this task.
If there is a way to read BAM file directly, why can't I just read every reads each time and extract the read group information, then write @RG tag back to the read? Is that a better solution?
Is there really a scripting method can read BAM files directly? To my best knowledge, I can only use samtools view to read it.
If not, should I convert BAM file into SAM and read SAM file as text format instead?
Thanks.
I don't think you need to decompress to SAM upfront -- you could use samtools to read in each line. But then you'll have to write out each modified line to SAM format, then convert back to BAM, yes.
Thank you for your help.
But I checked samtools document and didn't find how to read a single line each time. Could you please help me?
With this strategy, does that mean we need to:
for (each line){
$line=samtools view .....
$RG=[modify @RG information]
$line=$line.$RG
[print to SAM file]
}
Will that be very very slow I think? Since each time we need to launch samtools.
In the unix shell you can pipe the lines from samtools to serve as input for another program, script or whatever. So as samtools reads in lines, it passes them on for processing. For example, if you do
this will make samtools read each line and pass it to grep, which will check if it contains the phrase "HWI-" (a common read name starter) and if yes, it will pass that line on to sed, which will perform some kind of pattern substitution and write out the modified line to a given file.
This is just an example of the unix cmdline tool capabilities; you'll need to tweak it depending on what you want to do exactly. You can use just the grep part to write reads to separate pseudo-SAM files per lane and then re-header them and run through AddOrReplaceReadGroups, or you can skip the grep part and do the substitution in place with sed if you're comfortable with regular expressions, and just have a single file to re-header and convert back to BAM.
Note that this is just off the top of my head and there may be better solutions, or there may be major flaws with this approach that I haven't thought of. Since your question falls outside of scope of GATK support I can't really provide you with a full solution. But this should get you started, and others in the community may contribute better solutions.
Thank you for the detailed explanation for my question.
I will try to figure it out and also looking for better solution.
I definitely will update my work to contribute to other people who may have the same question.
@ygong , I meet the same problem, can you tell me how you deal with this problem?
Hi @liuqi,
Let me summarize and expand upon the solutions presented by @ygong and @Geraldine_VdAuwera. This assumes read names represent
<instrument>_<fcnumber>:<lane>:<tile>:<xcoord>:<ycoord>
.First, you can figure out the different read groups you will need. You could run this first part on a subset of the coordinate-sorted full-length BAM, e.g. a single smaller chromosome, as it is highly unlikely that some readgroup wouldn't have coverage on a contig.
You can take this one step further to get a shortlist of the read groups from the subset BAM.
If your BAM is query-name-sorted, then the command should be run on the entirety of the BAM but you can omit the
sort |
step.Then grep out each read group from the full-length BAM into a separate file, e.g. with
I imagine there is a way to then stream these reads into a process that would add a header, read group line and tags as well as convert to BAM. Good luck and I hope we hear from @ygong what their solution was.
Thank you for your reply, @shlee
I will try to do it. Thanks!
Good luck @liuqi. Also, you may find this thread helpful.