If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!
Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
How should I pre-process data from multiplexed sequencing and multi-library designs?
Our Best Practices pre-processing documentation assumes a simple experimental design in which you have one set of input sequence files (forward/reverse or interleaved FASTQ, or unmapped uBAM) per sample, and you run each step of the pre-processing workflow separately for each sample, resulting in one BAM file per sample at the end of this phase.
However, if you are generating multiple libraries for each sample, and/or multiplexing samples within and/or across sequencing lanes, the data must be de-multiplexed before pre-processing, typically resulting in multiple sets of FASTQ files per sample all of which should have distinct read group IDs (RGID).
At that point there are several different valid strategies for implementing the pre-processing workflow. Here at the Broad Institute, we run the initial steps of the pre-processing workflow (mapping and sorting) separately on each individual read group. Then we merge the data to produce a single BAM file for each sample (aggregation); this is done conveniently at the same time that we do the duplicate marking, by running Mark Duplicates on all read group BAM files for a sample at the same time. Then we run Base Recalibration on the aggregated per-sample BAM files. See the worked-out example below for more details.
Note that there are many possible ways to achieve a similar result; here we present the way we think gives the best combination of efficiency and quality. This assumes that you are dealing with one or more samples, and each of them was sequenced on one or more lanes.
Let's say we have this example data (assuming interleaved FASTQs containing both forward and reverse reads) for two sample libraries, sampleA and sampleB, which were each sequenced on two lanes, lane1 and lane2:
These will each be identified as separate read groups A1, A2, B1 and B2. If we had multiple libraries per sample, we would further distinguish them (eg sampleA_lib1_lane1.fq leading to read group A11, sampleA_lib2_lane1.fq leading to read group A21 and so on).
1. Run initial steps per-readgroup once
Assuming that you received one FASTQ file per sample library, per lane of sequence data (which amounts to a read group), run each file through mapping and sorting. During the mapping step you assign read group information, which will be very important in the next steps so be sure to do it correctly. See the read groups dictionary entry for guidance.
The example data becomes:
Note that we used to do a first round of marking duplicates here for QC purposes but tool improvements have rendered this obsolete.
2. Merge read groups and mark duplicates per sample (aggregation + dedup)
Once you have pre-processed each read group individually, you merge read groups belonging to the same sample into a single BAM file. You can do this as a standalone step, bur for the sake of efficiency we combine this with the per-sample duplicate marking step (it's simply a matter of passing the multiple inputs to MarkDuplicates in a single command).
The example data becomes:
This process of marking duplicates eliminates PCR duplicates (arising from library preparation) across all lanes in addition to optical duplicates (which are by definition only per-lane).
3. Remaining per-sample pre-processing
Then all that's left to run is base recalibration (BQSR). We used to have another step here called indel realignment, but assuming you're using a modern caller like HaplotypeCaller or Mutect2, you don't need to do it. You would only do so if you were using a locus-based variant caller like MuTect 1 or UnifiedGenotyper, but why would you want to do that? (You really don't)
The example data becomes:
Base recalibration will be applied per-read group if you assigned appropriate read group information in your data. BaseRecalibrator distinguishes read groups by RGID, or RGPU if it is available (PU takes precedence over ID). This will identify separate read groups (distinguishing both lanes and libraries) as such even if they are in the same BAM file, and it will always process them separately -- as long as the read groups are identified correctly of course. There would be no sense in trying to recalibrate across lanes, since the purpose of this processing step is to compensate for the errors made by the machine during sequencing, and the lane is the base unit of the sequencing machine (assuming the equipment is Illumina HiSeq or similar technology).