How should I pre-process data from multiplexed sequencing and multi-library designs?

Geraldine_VdAuweraGeraldine_VdAuwera Posts: 9,962Administrator, Dev admin
edited May 20 in FAQs

Our Best Practices pre-processing documentation assumes a simple experimental design in which you have one set of input sequence files (forward/reverse or interleaved FASTQ, or unmapped uBAM) per sample, and you run each step of the pre-processing workflow separately for each sample, resulting in one BAM file per sample at the end of this phase.

However, if you are generating multiple libraries for each sample, and/or multiplexing samples within and/or across sequencing lanes, the data must be de-multiplexed before pre-processing, typically resulting in multiple sets of FASTQ files per sample all of which should have distinct read group IDs (RGID).

At that point there are several different valid strategies for implementing the pre-processing workflow. Here at the Broad Institute, we run the initial steps of the pre-processing workflow (mapping, sorting and marking duplicates) separately on each individual read group. Then we merge the data to produce a single BAM file for each sample (aggregation); this is done by re-running Mark Duplicates, this time on all read group BAM files for a sample at the same time. Then we run Indel Realignment and Base Recalibration on the aggregated per-sample BAM files. See the worked-out example below and this presentation for more details.

Note that there are many possible ways to achieve a similar result; here we present the way we think gives the best combination of efficiency and quality. This assumes that you are dealing with one or more samples, and each of them was sequenced on one or more lanes.

Example

Let's say we have this example data (assuming interleaved FASTQs containing both forward and reverse reads) for two sample libraries, sampleA and sampleB, which were each sequenced on two lanes, lane1 and lane2:

  • sampleA_lane1.fq
  • sampleA_lane2.fq
  • sampleB_lane1.fq
  • sampleB_lane2.fq

These will each be identified as separate read groups A1, A2, B1 and B2. If we had multiple libraries per sample, we would further distinguish them (eg sampleA_lib1_lane1.fq leading to read group A11, sampleA_lib2_lane1.fq leading to read group A21 and so on).

1. Run initial steps per-readgroup once

Assuming that you received one FASTQ file per sample library, per lane of sequence data (which amounts to a read group), run each file through mapping and sorting. During the mapping step you assign read group information, which will be very important in the next steps so be sure to do it correctly. See the read groups dictionary entry for guidance.

The example data becomes:

  • sampleA_rgA1.bam
  • sampleA_rgA2.bam
  • sampleB_rgB1.bam
  • sampleB_rgB2.bam

At this point we mark duplicates in each read group BAM file (dedup), which allows us to estimate the complexity of the corresponding library of origin as a quality control step. This step is optional.

The example data becomes:

  • sampleA_rgA1.dedup.bam
  • sampleA_rgA2.dedup.bam
  • sampleB_rgB1.dedup.bam
  • sampleB_rgB2.dedup.bam

Technically this first run of marking duplicates is not necessary because we will run it again per-sample, and that per-sample marking would be enough to achieve the desired result. To reiterate, we only do this round of marking duplicates for QC purposes.

2. Merge read groups and mark duplicates per sample (aggregation + dedup)

Once you have pre-processed each read group individually, you merge read groups belonging to the same sample into a single BAM file. You can do this as a standalone step, bur for the sake of efficiency we combine this with the per-readgroup duplicate marking step (it's simply a matter of passing the multiple inputs to MarkDuplicates in a single command).

The example data becomes:

  • sampleA.merged.dedup.bam
  • sampleB.merged.dedup.bam

To be clear, this is the round of marking duplicates that matters. It eliminates PCR duplicates (arising from library preparation) across all lanes in addition to optical duplicates (which are by definition only per-lane).

3. Remaining per-sample pre-processing

Then you run indel realignment (optional) and base recalibration (BQSR).

The example data becomes:

  • sample1.merged.dedup.(realn).recal.bam
  • sample2.merged.dedup.(realn).recal.bam

Realigning around indels per-sample leads to consistent alignments across all lanes within a sample. This step is only necessary if you will be using a locus-based variant caller like MuTect 1 or UnifiedGenotyper (for legacy reasons). If you will be using HaplotypeCaller or MuTect2, you do not need to perform indel realignment.

Base recalibration will be applied per-read group if you assigned appropriate read group information in your data. BaseRecalibrator distinguishes read groups by RGID, or RGPU if it is available (PU takes precedence over ID). This will identify separate read groups (distinguishing both lanes and libraries) as such even if they are in the same BAM file, and it will always process them separately -- as long as the read groups are identified correctly of course. There would be no sense in trying to recalibrate across lanes, since the purpose of this processing step is to compensate for the errors made by the machine during sequencing, and the lane is the base unit of the sequencing machine (assuming the equipment is Illumina HiSeq or similar technology).

People often ask also if it's worth the trouble to try realigning across all samples in a cohort. The answer is almost always no, unless you have very shallow coverage. The problem is that while it would be lovely to ensure consistent alignments around indels across all samples, the computational cost gets too ridiculous too fast. That being said, for contrastive calling projects -- such as cancer tumor/normals -- we do recommend realigning both the tumor and the normal together in general to avoid slight alignment differences between the two tissue types.

Post edited by Geraldine_VdAuwera on

Geraldine Van der Auwera, PhD

Comments

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 9,962Administrator, Dev admin

    Questions and comments up to August 2014 have been moved to an archival thread here:

    http://gatkforums.broadinstitute.org/discussion/4557/questions-about-data-processing-per-lane-vs-per-sample

    Geraldine Van der Auwera, PhD

  • zillurbmb51zillurbmb51 USAPosts: 15Member

    Hi,
    I have aligned paired end fast files using BOWTIE2, SAMTOOLS and BCFTOOLS.(SRR901606_1.fastq and SRR901606_2.fastq). Now I have eg606.sam, eg606.bam, eg606.sorted.bam, eg606.raw.bcf. Now which will be my starting point to improve the quality of my data. Is there any sample command line for GATK? " each pre-processing step individually: map & dedup -> realign -> recal." How can I do this?

  • jburattijburatti Posts: 6Member

    Hi,
    I am trying to identify the best GATK pipeline to use with my multiplexed whole exome data.

    In this topic, you say that it is better to run per-lane: map & dedup -> realign -> recal and then after merging lanes, a second pass of per-sample: dedup -> realign (but no 2nd pass of recal step, because "by definition there is no sense in trying to recalibrate across lanes").
    Okay... it makes sense.
    However, I found in the "1506" best practices slides (https://www.broadinstitute.org/gatk/events/slides/1506/GATKwr8-A-3-GATK_Best_Practices_and_Broad_pipelines.pdf):
    Per-lane: map & dedup, merge lanes (aggregation), then per-sample: dedup (2nd pass) -> realign -> recal.

    I feel confused now... I need your help please !

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 9,962Administrator, Dev admin

    Hmm our docs may be slightly out of sync, apologies for the confusion. The two ways are essentially equivalent; in the past we recommended doing realignment twice but not anymore. We'll update this document to reflect that and be in sync with the slides.

    Geraldine Van der Auwera, PhD

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 9,962Administrator, Dev admin

    Edited the doc to reflect how pre-processing is implemented in production at Broad.

    Geraldine Van der Auwera, PhD

  • jburattijburatti Posts: 6Member

    Thank you for your quick response Geraldine !

  • mglclinicalmglclinical USAPosts: 53Member

    Hi @Geraldine_VdAuwera

    Thank you for updating/editing this post. Thank you for emphasizing why Baserecalibration is being applied with in a lane, not across all lanes of a sample. Also, thanks for mentioning that PU takes precedence over ID in read groups for baserecalibration.

    Previously you mentioned here , that it is better to re-align (RTC + IR) per lane. Doing per-lane clean up increases the probability of doing re-alignment in high depth regions, as the amount of data will be less when compared to the data in all-lanes-merged-sample bam file. If per-lane clean up is not done, then those high-depth regions will be skipped by GATK.

    I see that the new guideline (Lane Level : Map + Sort +Dedup ; All-Lanes-Merged-Sample Level : Dedup + Realign + Recalibrate) is released/updated after the GATK v 3.5 release. Does that mean that, with previous GATK3.4 version it is better to Re-Align at lane level as it cannot re-align regions of high-depth in all-lanes-merged-sample bam file, where-as current GATK v3.5 has capability to re-align in high-depth regions in all-lanes-merged-sample bam file ?

    My pipeline currently uses GATK v3.4, and it follows previous Best Practices guideline as follows (Lane Level : Map + Sort + Dedup + Realign + Recalibrate ; All-Lanes-Merged-Sample Level : Sort + Dedup + Realign).

    I have 4 lanes of data per sample (from Nextseq500 instrument) and our exome fastq.gz files can be as large as ~20 million reads in both paired-end reads. Steps Realign + Recalibrate at lane level are time consuming, and skipping these 2 steps at lane level, and including the Baserecalibration step at merged sample level, as according to new recommendation, decreases my overall processing time.

    If I switch to GATK v3.5, would indels will be properly realigned in high-depth regions, even though I skip lane-level-realignment ?

    Thanks,
    mglclinical

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 9,962Administrator, Dev admin

    @mglclinical These recent edits don't reflect any significant change of operation in the tools, just a shift in pipeline implementation philosophy. While indel realignment used to be very important when we were calling variants with UnifiedGenotyper, the HaplotypeCaller's ability to perform its own realignment has reduced the importance of the pre-processing realignment. It is still performed because it can produce benefits for base recalibration, but now we're satisfied with a best-effort pass per-sample, as opposed to the more thorough pass that can be achieved per-lane.

    So as long as you're calling variants with HaplotypeCaller, you can follow the new recommendations with either version without concern.

    Geraldine Van der Auwera, PhD

  • mglclinicalmglclinical USAPosts: 53Member

    @Geraldine_VdAuwera thanks for the quick clarification

  • pwaltmanpwaltman New York, NYPosts: 9Member
    edited March 25

    Maybe I'm confused, but with the new shift in philosophy, there seems to be minimal value in aligning, sorting and de-duping the lane-specific fastq's now. Given the current philosophy, it does not appear that there would be much difference between any of the following approaches:
    * GATK best practices: per-lane align, per-lane sort, per-lane de-dup, merged, re-sort, re-dedup, re-align, ....
    * per-lane align, merge, sort, de-dup, re-align, ....
    * concat fastq's, align, merge, sort, de-dup, re-align, ....

    The article above states that the per-lane sorting reduces the complexity of the subsequent analysis, but unlike the previous suggested practice (which was aimed to reduce lane-specific artifacts and improve alignment accuracy), there seems to be little value in the additional computation).

    Am I missing something? If not, would it make sense to modify the article to state that if you are using the UnifiedGenotyper, you should follow the old instructions; otherwise, if you are using the HaplotypeCaller, you can follow the updated instructions, or just follow one of the simpler pipelines that I just listed?

    @Geraldine_VdAuwera said:
    @mglclinical These recent edits don't reflect any significant change of operation in the tools, just a shift in pipeline implementation philosophy. While indel realignment used to be very important when we were calling variants with UnifiedGenotyper, the HaplotypeCaller's ability to perform its own realignment has reduced the importance of the pre-processing realignment. It is still performed because it can produce benefits for base recalibration, but now we're satisfied with a best-effort pass per-sample, as opposed to the more thorough pass that can be achieved per-lane.

    So as long as you're calling variants with HaplotypeCaller, you can follow the new recommendations with either version without concern.

    Post edited by pwaltman on

    Issue · Github
    by Sheila

    Issue Number
    750
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    vdauwera
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 9,962Administrator, Dev admin

    Hi @pwaltman, sorry for the late reply. You summarize our recommendations above as "per-lane align, per-lane sort, per-lane de-dup, merged, re-sort, re-dedup, re-align, ...." but to be clear in your terms it would be "per-lane align, per-lane sort, per-lane dedup, merge, (no re-sort), re-dedup (actually this step can be used to do the merge too), local realign (to be clear it's not a repeat of the original align), ...". This produces the same core advantages as the previous version but in a more compact and efficient workflow. You're correct that results should be largely equivalent to the two other workflows you propose if I'm reading them correctly (I assume you're not proposing to combine all the fastqs before genome-aligning, since you would lose the ability to distinguish read groups); the main difference is the double dedup, which we do for QC reasons (to get per-lane library complexity metrics).

    Geraldine Van der Auwera, PhD

  • medhatmedhat PolandPosts: 6Member

    Thanks for this post, what shall I do for sample that have number of libraries and each library have number of lanes , should it be per-sample different library ending in one sample have number of bams file for each library?

  • SheilaSheila Broad InstitutePosts: 3,222Member, Broadie, Moderator, Dev admin

    @medhat
    Hi,

    I think this post will help you.

    -Sheila

Sign In or Register to comment.