Service notice: Several of our team members are on vacation so service will be slow through at least July 13th, possibly longer depending on how much backlog accumulates during that time. This means that for a while it may take us more time than usual to answer your questions. Thank you for your patience.

How should I pre-process data from multiplexed sequencing and multi-library designs?

Our Best Practices pre-processing documentation assumes a simple experimental design in which you have one set of input sequence files (forward/reverse or interleaved FASTQ, or unmapped uBAM) per sample, and you run each step of the pre-processing workflow separately for each sample, resulting in one BAM file per sample at the end of this phase.

However, if you are generating multiple libraries for each sample, and/or multiplexing samples within and/or across sequencing lanes, the data must be de-multiplexed before pre-processing, typically resulting in multiple sets of FASTQ files per sample all of which should have distinct read group IDs (RGID).

At that point there are several different valid strategies for implementing the pre-processing workflow. Here at the Broad Institute, we run the initial steps of the pre-processing workflow (mapping and sorting) separately on each individual read group. Then we merge the data to produce a single BAM file for each sample (aggregation); this is done conveniently at the same time that we do the duplicate marking, by running Mark Duplicates on all read group BAM files for a sample at the same time. Then we run Base Recalibration on the aggregated per-sample BAM files. See the worked-out example below for more details.


Example

Note that there are many possible ways to achieve a similar result; here we present the way we think gives the best combination of efficiency and quality. This assumes that you are dealing with one or more samples, and each of them was sequenced on one or more lanes.

Let's say we have this example data (assuming interleaved FASTQs containing both forward and reverse reads) for two sample libraries, sampleA and sampleB, which were each sequenced on two lanes, lane1 and lane2:

  • sampleA_lane1.fq
  • sampleA_lane2.fq
  • sampleB_lane1.fq
  • sampleB_lane2.fq

These will each be identified as separate read groups A1, A2, B1 and B2. If we had multiple libraries per sample, we would further distinguish them (eg sampleA_lib1_lane1.fq leading to read group A11, sampleA_lib2_lane1.fq leading to read group A21 and so on).

1. Run initial steps per-readgroup once

Assuming that you received one FASTQ file per sample library, per lane of sequence data (which amounts to a read group), run each file through mapping and sorting. During the mapping step you assign read group information, which will be very important in the next steps so be sure to do it correctly. See the read groups dictionary entry for guidance.

The example data becomes:

  • sampleA_rgA1.bam
  • sampleA_rgA2.bam
  • sampleB_rgB1.bam
  • sampleB_rgB2.bam

Note that we used to do a first round of marking duplicates here for QC purposes but tool improvements have rendered this obsolete.

2. Merge read groups and mark duplicates per sample (aggregation + dedup)

Once you have pre-processed each read group individually, you merge read groups belonging to the same sample into a single BAM file. You can do this as a standalone step, bur for the sake of efficiency we combine this with the per-sample duplicate marking step (it's simply a matter of passing the multiple inputs to MarkDuplicates in a single command).

The example data becomes:

  • sampleA.merged.dedup.bam
  • sampleB.merged.dedup.bam

This process of marking duplicates eliminates PCR duplicates (arising from library preparation) across all lanes in addition to optical duplicates (which are by definition only per-lane).

3. Remaining per-sample pre-processing

Then all that's left to run is base recalibration (BQSR). We used to have another step here called indel realignment, but assuming you're using a modern caller like HaplotypeCaller or Mutect2, you don't need to do it. You would only do so if you were using a locus-based variant caller like MuTect 1 or UnifiedGenotyper, but why would you want to do that? (You really don't)

The example data becomes:

  • sample1.merged.dedup.recal.bam
  • sample2.merged.dedup.recal.bam

Base recalibration will be applied per-read group if you assigned appropriate read group information in your data. BaseRecalibrator distinguishes read groups by RGID, or RGPU if it is available (PU takes precedence over ID). This will identify separate read groups (distinguishing both lanes and libraries) as such even if they are in the same BAM file, and it will always process them separately -- as long as the read groups are identified correctly of course. There would be no sense in trying to recalibrate across lanes, since the purpose of this processing step is to compensate for the errors made by the machine during sequencing, and the lane is the base unit of the sequencing machine (assuming the equipment is Illumina HiSeq or similar technology).

Sign In or Register to comment.