Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
For the GATK best practice pipeline, the first step is BWA. I now have a question on the read group tag for BWA.
This website https://www.ebi.ac.uk/ena/data/view/PRJEB20654 has links to download raw WGS data. Take the first sample on this webpage for example. The first sample's run accession is ERR1955393. The first row of the raw FASTQ file pairs are as following:
I also downloaded the corresponding BAM file for this sample. When I use samtools view -H, I found the following RG information:
@RG ID:0 PL:ILLUMINA SM:LP2100024-DNA_B02 PU:HVW2MCCXX:2:none.
Based on this information, should I have written my bwa command as the following?
bwa mem -R "@RG\tID:0\tPL:ILLUMINA\tSM:LP2100024-DNA_B02\tPU:HVW2MCCXX:6:none"
I guess not. I recently got a FASTQ file from another company. The first FASTQ file is named 180729_I410_CL100081320_L1_HUMcxtRAAAB-549_1.fq.gz, and the first row of this FASTQ file is very simple, as following: @CL100081320L1C001R001_3/1
In this case, if I simply know the sample name (HKYD18051117_A) and the label from the FASTQ file, I will not be able to generate the @RG tag as shown in the BAM file.
So, I am confused now. If a sequencing company simply provide me with FASTQ files, how should I write the bwa mem -R XXX command to start the GATK pipeline? Do I only need to make sure that the @RG tag is unique or I should write the @RG tag exactly as the one I just showed above?