If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!
Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
We will be out of the office on November 11th and 13th 2019, due to the U.S. holiday(Veteran's day) and due to a team event(Nov 13th). We will return to monitoring the GATK forum on November 12th and 14th respectively. Thank you for your patience.
For the GATK best practice pipeline, the first step is BWA. I now have a question on the read group tag for BWA.
This website https://www.ebi.ac.uk/ena/data/view/PRJEB20654 has links to download raw WGS data. Take the first sample on this webpage for example. The first sample's run accession is ERR1955393. The first row of the raw FASTQ file pairs are as following:
I also downloaded the corresponding BAM file for this sample. When I use samtools view -H, I found the following RG information:
@RG ID:0 PL:ILLUMINA SM:LP2100024-DNA_B02 PU:HVW2MCCXX:2:none.
Based on this information, should I have written my bwa command as the following?
bwa mem -R "@RG\tID:0\tPL:ILLUMINA\tSM:LP2100024-DNA_B02\tPU:HVW2MCCXX:6:none"
I guess not. I recently got a FASTQ file from another company. The first FASTQ file is named 180729_I410_CL100081320_L1_HUMcxtRAAAB-549_1.fq.gz, and the first row of this FASTQ file is very simple, as following: @CL100081320L1C001R001_3/1
In this case, if I simply know the sample name (HKYD18051117_A) and the label from the FASTQ file, I will not be able to generate the @RG tag as shown in the BAM file.
So, I am confused now. If a sequencing company simply provide me with FASTQ files, how should I write the bwa mem -R XXX command to start the GATK pipeline? Do I only need to make sure that the @RG tag is unique or I should write the @RG tag exactly as the one I just showed above?