Test-drive the GATK tools and Best Practices pipelines on Terra

Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

BWA read-group-tag

Dear all:

For the GATK best practice pipeline, the first step is BWA. I now have a question on the read group tag for BWA.

This website https://www.ebi.ac.uk/ena/data/view/PRJEB20654 has links to download raw WGS data. Take the first sample on this webpage for example. The first sample's run accession is ERR1955393. The first row of the raw FASTQ file pairs are as following:
@ERR1955393.1 HVW2MCCXX:6:2203:1568944:0/1
@ERR1955393.1 HVW2MCCXX:6:2203:1568944:0/2

I also downloaded the corresponding BAM file for this sample. When I use samtools view -H, I found the following RG information:

Based on this information, should I have written my bwa command as the following?
bwa mem -R "@RG\tID:0\tPL:ILLUMINA\tSM:LP2100024-DNA_B02\tPU:HVW2MCCXX:6:none"

I guess not. I recently got a FASTQ file from another company. The first FASTQ file is named 180729_I410_CL100081320_L1_HUMcxtRAAAB-549_1.fq.gz, and the first row of this FASTQ file is very simple, as following: @CL100081320L1C001R001_3/1

But the @RG tag in the BAM file is quite complicated, as following:
@RG ID:180729_I410_CL100081320_L1_HUMcxtRAAAB-549 PL:ILLUMINA PU:CL100081320_L1 LB:HUMcxtRAAAB SM:HKYD18051117_A

In this case, if I simply know the sample name (HKYD18051117_A) and the label from the FASTQ file, I will not be able to generate the @RG tag as shown in the BAM file.

So, I am confused now. If a sequencing company simply provide me with FASTQ files, how should I write the bwa mem -R XXX command to start the GATK pipeline? Do I only need to make sure that the @RG tag is unique or I should write the @RG tag exactly as the one I just showed above?

Best regards,


Sign In or Register to comment.