How should I pre-process data from multiplexed sequencing and multi-library designs?

Geraldine_VdAuweraGeraldine_VdAuwera Posts: 9,283Administrator, Dev admin
edited December 2015 in FAQs

Our Best Practices Pre-processing documentation assumes a simple experimental design in which you have one set of input sequence files (forward/reverse or interleaved FASTQ, or unmapped uBAM) per sample, and you run each step of the pre-processing workflow separately for each sample, resulting in one BAM file per sample at the end of this phase.

However, if you are generating multiple libraries for each sample, and/or multiplexing samples within and/or across sequencing lanes, the data must be de-multiplexed before pre-processing, typically resulting in multiple sets of FASTQ files per sample all of which should have distinct read group IDs. At that point there are several different valid strategies for implementing the pre-processing workflow. At the Broad Institute, we run the initial steps of the pre-processing workflow (mapping and sorting) separately on each individual read group, then we merge the data to produce a single BAM file for each sample (aggregation). Then we re-run Mark Duplicates, run Indel Realignment and run Base Recalibration on the aggregated per-sample BAM files. See the worked-out example below and this presentation for more details.

Note that there are many possible ways to achieve a similar result; here we present the way we think gives the best combination of efficiency and quality. This assumes that you are dealing with one or more samples, and each of them was sequenced on one or more lanes.


Let's say we have this example data:

  • sample1_lane1.fq
  • sample1_lane2.fq
  • sample2_lane1.fq
  • sample2_lane2.fq

1. Run initial steps per-lane once

Assuming that you received one FASTQ file per lane of sequence data, run each file through mapping, sorting and marking duplicates (dedup).

During the mapping step you assign read group information, which will be very important in the next steps so be sure to do it correctly. See the GATK Dictionary on read groups got guidance.

The example data becomes:

  • sample1_lane1.dedup.bam
  • sample1_lane2.dedup.bam
  • sample2_lane1.dedup.bam
  • sample2_lane2.dedup.bam

Technically this first run of marking duplicates is not necessary because we will run it a second time per-sample, and that per-sample would be enough to achieve the desired result. However, running it once per-lane allows us to estimate library complexity as a quality control step.

2. Merge lanes per sample (aggregation)

Once you have pre-processed each lane individually, you merge lanes belonging to the same sample into a single BAM file.

The example data becomes:

  • sample1.merged.bam
  • sample2.merged.bam

3. Per-sample pre-processing

At this point you perform an extra round of marking duplicates. This eliminates PCR duplicates (arising from library preparation) across all lanes in addition to optical duplicates (which are by definition only per-lane).

The example data becomes:

  • sample1.merged.dedup.bam
  • sample2.merged.dedup.bam

Then you run indel realignment and base recalibration (BQSR).

Realigning per-sample means that you will have consistent alignments across all lanes within a sample.

Base recalibration will be applied per-lane (as it should be) if you assigned appropriate read group information in your data.

The example data becomes:

  • sample1.merged.dedup.realn.recal.bam
  • sample2.merged.dedup.realn.recal.bam

This works because the GATK's BaseRecalibrator is read group-aware, which means that it will identify separate lanes as such even if they are in the same BAM file, and it will always process them separately -- as long as the read groups are identified correctly of course. There would be no sense in trying to recalibrate across lanes, since the purpose of this processing step is to compensate for the errors made by the machine during sequencing, and the lane is the base unit of the sequencing machine.

Note that BaseRecalibrator distinguishes read groups by RGID, or RGPU if it is available (PU takes precedence over ID).

Finally, people often ask also if it's worth the trouble to try realigning across all samples in a cohort. The answer is almost always no, unless you have very shallow coverage. The problem is that while it would be lovely to ensure consistent alignments around indels across all samples, the computational cost gets too ridiculous too fast. That being said, for contrastive calling projects -- such as cancer tumor/normals -- we do recommend realigning both the tumor and the normal together in general to avoid slight alignment differences between the two tissue types.

Post edited by Geraldine_VdAuwera on

Geraldine Van der Auwera, PhD


  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 9,283Administrator, Dev admin

    Questions and comments up to August 2014 have been moved to an archival thread here:

    Geraldine Van der Auwera, PhD

  • zillurbmb51zillurbmb51 USAPosts: 15Member

    I have aligned paired end fast files using BOWTIE2, SAMTOOLS and BCFTOOLS.(SRR901606_1.fastq and SRR901606_2.fastq). Now I have eg606.sam, eg606.bam, eg606.sorted.bam, eg606.raw.bcf. Now which will be my starting point to improve the quality of my data. Is there any sample command line for GATK? " each pre-processing step individually: map & dedup -> realign -> recal." How can I do this?

  • jburattijburatti Posts: 6Member

    I am trying to identify the best GATK pipeline to use with my multiplexed whole exome data.

    In this topic, you say that it is better to run per-lane: map & dedup -> realign -> recal and then after merging lanes, a second pass of per-sample: dedup -> realign (but no 2nd pass of recal step, because "by definition there is no sense in trying to recalibrate across lanes").
    Okay... it makes sense.
    However, I found in the "1506" best practices slides (
    Per-lane: map & dedup, merge lanes (aggregation), then per-sample: dedup (2nd pass) -> realign -> recal.

    I feel confused now... I need your help please !

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 9,283Administrator, Dev admin

    Hmm our docs may be slightly out of sync, apologies for the confusion. The two ways are essentially equivalent; in the past we recommended doing realignment twice but not anymore. We'll update this document to reflect that and be in sync with the slides.

    Geraldine Van der Auwera, PhD

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 9,283Administrator, Dev admin

    Edited the doc to reflect how pre-processing is implemented in production at Broad.

    Geraldine Van der Auwera, PhD

  • jburattijburatti Posts: 6Member

    Thank you for your quick response Geraldine !

  • mglclinicalmglclinical USAPosts: 23Member

    Hi @Geraldine_VdAuwera

    Thank you for updating/editing this post. Thank you for emphasizing why Baserecalibration is being applied with in a lane, not across all lanes of a sample. Also, thanks for mentioning that PU takes precedence over ID in read groups for baserecalibration.

    Previously you mentioned here , that it is better to re-align (RTC + IR) per lane. Doing per-lane clean up increases the probability of doing re-alignment in high depth regions, as the amount of data will be less when compared to the data in all-lanes-merged-sample bam file. If per-lane clean up is not done, then those high-depth regions will be skipped by GATK.

    I see that the new guideline (Lane Level : Map + Sort +Dedup ; All-Lanes-Merged-Sample Level : Dedup + Realign + Recalibrate) is released/updated after the GATK v 3.5 release. Does that mean that, with previous GATK3.4 version it is better to Re-Align at lane level as it cannot re-align regions of high-depth in all-lanes-merged-sample bam file, where-as current GATK v3.5 has capability to re-align in high-depth regions in all-lanes-merged-sample bam file ?

    My pipeline currently uses GATK v3.4, and it follows previous Best Practices guideline as follows (Lane Level : Map + Sort + Dedup + Realign + Recalibrate ; All-Lanes-Merged-Sample Level : Sort + Dedup + Realign).

    I have 4 lanes of data per sample (from Nextseq500 instrument) and our exome fastq.gz files can be as large as ~20 million reads in both paired-end reads. Steps Realign + Recalibrate at lane level are time consuming, and skipping these 2 steps at lane level, and including the Baserecalibration step at merged sample level, as according to new recommendation, decreases my overall processing time.

    If I switch to GATK v3.5, would indels will be properly realigned in high-depth regions, even though I skip lane-level-realignment ?


  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 9,283Administrator, Dev admin

    @mglclinical These recent edits don't reflect any significant change of operation in the tools, just a shift in pipeline implementation philosophy. While indel realignment used to be very important when we were calling variants with UnifiedGenotyper, the HaplotypeCaller's ability to perform its own realignment has reduced the importance of the pre-processing realignment. It is still performed because it can produce benefits for base recalibration, but now we're satisfied with a best-effort pass per-sample, as opposed to the more thorough pass that can be achieved per-lane.

    So as long as you're calling variants with HaplotypeCaller, you can follow the new recommendations with either version without concern.

    Geraldine Van der Auwera, PhD

  • mglclinicalmglclinical USAPosts: 23Member

    @Geraldine_VdAuwera thanks for the quick clarification

Sign In or Register to comment.