Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
Attention:
We will be out of the office on October 14, 2019, due to the U.S. holiday. We will return to monitoring the forum on October 15.

BWA read-group-tag

Dear all:

For the GATK best practice pipeline, the first step is BWA. I now have a question on the read group tag for BWA.

This website https://www.ebi.ac.uk/ena/data/view/PRJEB20654 has links to download raw WGS data. Take the first sample on this webpage for example. The first sample's run accession is ERR1955393. The first row of the raw FASTQ file pairs are as following:
@ERR1955393.1 HVW2MCCXX:6:2203:1568944:0/1
@ERR1955393.1 HVW2MCCXX:6:2203:1568944:0/2

I also downloaded the corresponding BAM file for this sample. When I use samtools view -H, I found the following RG information:
@RG ID:0 PL:ILLUMINA SM:LP2100024-DNA_B02 PU:HVW2MCCXX:2:none.

Based on this information, should I have written my bwa command as the following?
bwa mem -R "@RG\tID:0\tPL:ILLUMINA\tSM:LP2100024-DNA_B02\tPU:HVW2MCCXX:6:none"

I guess not. I recently got a FASTQ file from another company. The first FASTQ file is named 180729_I410_CL100081320_L1_HUMcxtRAAAB-549_1.fq.gz, and the first row of this FASTQ file is very simple, as following: @CL100081320L1C001R001_3/1

But the @RG tag in the BAM file is quite complicated, as following:
@RG ID:180729_I410_CL100081320_L1_HUMcxtRAAAB-549 PL:ILLUMINA PU:CL100081320_L1 LB:HUMcxtRAAAB SM:HKYD18051117_A

In this case, if I simply know the sample name (HKYD18051117_A) and the label from the FASTQ file, I will not be able to generate the @RG tag as shown in the BAM file.

So, I am confused now. If a sequencing company simply provide me with FASTQ files, how should I write the bwa mem -R XXX command to start the GATK pipeline? Do I only need to make sure that the @RG tag is unique or I should write the @RG tag exactly as the one I just showed above?

Best regards,
JIe

Answers

Sign In or Register to comment.