Bug Bulletin: The recent 3.2 release fixes many issues. If you run into a problem, please try the latest version before posting a bug report, as your problem may already have been solved.

Which data processing steps should I do per-lane vs. per-sample?

Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,904Administrator, GATK Developer admin
edited September 2013 in FAQs

Note that there are many possible ways to achieve a similar result; here we present the way we think gives the best combination of efficiency and quality. This assumes that you are dealing with one or more samples, and each of them was sequenced on one or more lanes.

Let's say we have this example data:

  • sample1_lane1.fq
  • sample1_lane2.fq
  • sample2_lane1.fq
  • sample2_lane2.fq

1. Run all core steps per-lane once

At the basic level, all pre-processing steps are meant to be performed per-lane. Assuming that you received one FASTQ file per lane of sequence data, just run each file through each pre-processing step individually: map & dedup -> realign -> recal.

The example data becomes:

  • sample1_lane1.dedup.realn.recal.bam
  • sample1_lane2.dedup.realn.recal.bam
  • sample2_lane1.dedup.realn.recal.bam
  • sample2_lane2.dedup.realn.recal.bam

2. Merge lanes per sample

Once you have pre-processed each lane individually, you merge lanes belonging to the same sample into a single BAM file.

The example data becomes:

  • sample1.merged.bam
  • sample2.merged.bam

3. Per-sample refinement

You can increase the quality of your results by performing an extra round of dedupping and realignment, this time at the sample level. It is not absolutely required and will increase your computational costs, so it's up to you to decide whether you want to do it on your data, but that's how we do it internally at Broad.

The example data becomes:

  • sample1.merged.dedup.realn.bam
  • sample2.merged.dedup.realn.bam

This gets you two big wins:

  • Dedupping per-sample eliminates PCR duplicates across all lanes in addition to optical duplicates (which are by definition only per-lane)
  • Realigning per-sample means that you will have consistent alignments across all lanes within a sample.

People often ask also if it's worth the trouble to try realigning across all samples in a cohort. The answer is almost always no, unless you have very shallow coverage. The problem is that while it would be lovely to ensure consistent alignments around indels across all samples, the computational cost gets too ridiculous too fast. That being said, for contrastive calling projects -- such as cancer tumor/normals -- we do recommend realigning both the tumor and the normal together in general to avoid slight alignment differences between the two tissue types.

Finally, why not do base recalibration across lanes or across samples? Well, by definition there is no sense in trying to recalibrate across lanes, since the purpose of this processing step is to compensate for the errors made by the machine during sequencing, and the lane is the base unit of the sequencing machine. That said, don't worry if you find yourself needing to recalibrate a BAM file with the lanes already merged -- the GATK's BaseRecalibrator is read group-aware, which means that it will identify separate lanes as such even if they are in the same BAM file, and it will always process them separately.

Post edited by Geraldine_VdAuwera on

Geraldine Van der Auwera, PhD

Comments

  • tcezardtcezard Posts: 7Member

    Hi I was wondering why you need to do the dedup/realign preprocessing before and after merging. isn't it better to do it only after merging? It seems to me that all 3 processes (dedup/realign/base recal) can be done after merging so why not doing it only then. Cheers Tim

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,904Administrator, GATK Developer admin

    @tcezard, it's mainly a problem of computational cost. Running these operations on multiple lanes or multiple samples means handling a lot of data simultaneously, which makes the computational cost (which includes both money and time) increase very quickly. It's easier/cheaper to process lanes' worth of data separately, and the results are good enough.

    Geraldine Van der Auwera, PhD

  • tcezardtcezard Posts: 7Member

    Hi Geraldine Thank you for the clarification. This make sense for base recalibration but not really for duplicate marking and indel realignment since you're advising to do it after merging as well. Will indel realignment be faster/cheaper the second time around?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,904Administrator, GATK Developer admin

    Will indel realignment be faster/cheaper the second time around?

    No, actually the computational difficulty generally increases once you merge the data, and that's the point. In some regions, processing the data from multiple lanes will simply fail (because there is too much data) so the program will skip the region gracefully (as opposed to crashing). So if you haven't previously done per-lane cleanup, the entire region will remain a mess throughout and it's less likely that any good calls will come out of it. However, if you have already done per-lane cleanup (which is less likely to fail in any given spot because it involves less data), then the region will be at least partly cleaned up and you will still have a decent chance of getting decent calls out of it. It's a way of hedging your bets; kind of a Pascal's wager of data pre-processing, if that makes sense.

    Geraldine Van der Auwera, PhD

  • EugenieEugenie Posts: 13Member
    edited December 2013

    Dear Geraldine, first of all thank you for pointing on this article. Please can you help me to clarify some issues?

    1) if it makes sense to recalibrate across samples which were sequenced on the same lane? it seems that the more data the better for BQSR but I am not sure how to deal with different samples (I have data from 2 lanes for each sample but at the same time there are several samples on each lane - sorry, if it's a bit confusing;) actually the most stupid question is if I have to define different read group ID for the samples from the same lane - assuming that then they will be processed together during snp calling)

    2) thank you for such important remark about contrastive projects - it didn't realize it. But it's not clear for me how to do it: is it possible to realign across samples without merging bam's? (if finally I want to have separate bam's to proceed with Mutect)

    Thank you very much for your time and patience

    Post edited by Geraldine_VdAuwera on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,904Administrator, GATK Developer admin

    Hi @Eugenie,

    1. I understand what you mean, here at the Broad we also multiplex samples per lane. Typically we de-multiplex the data from each lane before processing. So if you have e.g. two three samples (A, B and C), each sequenced on two lanes (1 and 2), you will need to separate the data into six different files: A1, A2, B1, B2, C1, C2. For the data processing steps, you treat each of these as if they were a separate lane of data. For BQSR, it is true that it is better to have as much data together as possible, so if you have very many samples multiplex per lane this can become a problem, but if it's only a few then you will be fine as long as you run on the whole genome (or exome) rather than per contig (which people sometimes do to go faster). Regarding the read group IDs, just indicate which sample each belongs to, and which lane they were sequenced on. GATK tools will correctly aggregate data per sample in later steps.

    2. Yes, you can do this by using the -nWayOut argument. This will realign across samples but output separate files per sample instead of one merged bam.

    Geraldine Van der Auwera, PhD

  • chunxuanchunxuan Posts: 14Member

    Dear @Geraldine_VdAuwera,

    Thanks for the guide, just want to be clear: after merge, dedup and realn are recommend, but not recal ? Because recal will use far too much resource, and pay-off is not good?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,904Administrator, GATK Developer admin

    Hi @chunxuan,

    Recalibration is performed at the lane level only because it aims to correct systematic machine errors that are specific to each lane.

    Geraldine Van der Auwera, PhD

  • EugenieEugenie Posts: 13Member
    edited January 21

    Thank you so much for an explicit reply, Geraldine!

    1. Yes, it's WES and around 10 samples per lane, which I was afraid is quite a lot...
      2.Thanks! Sorry, I missed it somehow. Extremely useful option;)

    But one more question about the tumor/normal realignment . Would you recommend to perform the realignment first per lane and then across tumor-normal or it's better to include one more realignment step (per sample after merging lanes). As I understand it's always a question of power you need to handle multiple lanes as you already explained - so to be on the safe side it worth to realign 3 times in such case? Sorry if it's too specific

    Post edited by Geraldine_VdAuwera on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,904Administrator, GATK Developer admin

    Hi @Eugenie,

    For tumor/normal pairs you'll definitely want to do a realignment step on the data from both tumor and normal together. I'm not sure doing three realignment steps is worth it -- I would probably do realignment first per lane and then across tumor-normal. That doesn't represent a crazy amount of data; the power issue mostly affects multisample realignment of larger cohorts, whereas a tumor/normal pair is still fairly small potatoes in comparison.

    Geraldine Van der Auwera, PhD

  • EugenieEugenie Posts: 13Member

    Thank you, Geraldine! Yes, seems reasonable. Otherwise I'll realign forewer) And thanks again for the remark concerning contrastive projects

  • vifehevifehe SpainPosts: 9Member

    Hi Geraldine,

    probably I should have read this post before doing pre-processing of data, auch, I have WES data, one sample every two lanes, i.e. exo_1 and exo_2 (which I assume are forward and reverse). I did merge the lanes when I generated the SAM file doing:

    bwa mem -M -R "Read group info" reference.fa exo1.fq exo2.fq > aligned_exo.sam

    is that correct? or should I have pre-processed exo1 and exo2 separately and merged them after separately doing dedup realignment and recalibration?

    Thanks in advance!

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,904Administrator, GATK Developer admin

    Hi @vifehe‌,

    I would strongly recommend you check with your sequence data provider to make sure exactly what the two files correspond to. I would not recommend proceeding on assumptions only. Make sure you understand the difference between "lanes" as a unit of sequencing and "forward/reverse" as a property of sequenced reads.

    Geraldine Van der Auwera, PhD

  • achenge07achenge07 Beijing,CHINAPosts: 2Member

    Hi, lovely @Geraldine_VdAuwera‌ ,

    Firstly thanks for your great job!

    My question is:

    If I have 100 tumor samples and 100 normal samples, since you recommend realigning both the tumor and the normal together in general, is it ok for me to put all 200 samples into the realign tool in one step?

    whereas a tumor/normal pair is still fairly small potatoes in comparison.

    Or you meant 2 samples(tumor and paraneoplastic tissue from same person?) a time? If yes, how to do if No.(tumor samples) != No.(normal samples) or if each person only has one sample?

    sorry about my poor English :(

    yaowen

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,904Administrator, GATK Developer admin

    Hi @achenge07‌,

    I meant that you should process together samples that are matched, e.g. tumor and normal (or other type of tissue) from the same person. If some of your samples are not matched to others, you can process them individually.

    Geraldine Van der Auwera, PhD

  • achenge07achenge07 Beijing,CHINAPosts: 2Member
  • ForamForam Posts: 3Member

    We're doing a targeted sequencing project, multiplexing 36-48 samples per run. I ran BQSR per sample and it gives some funky results. Do you have any thoughts on whether creating a merged multi-sample file per lane to create the recal table, and then applying said recal table to individual samples would be a good idea? Thanks!

  • ZaagZaag Posts: 10Member

    @Geraldine_VdAuwera said: Hi achenge07‌,

    I meant that you should process together samples that are matched, e.g. tumor and normal (or other type of tissue) from the same person. If some of your samples are not matched to others, you can process them individually.

    Same for a trio or other small families?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,904Administrator, GATK Developer admin

    Hi @Foram,

    Yes, I think that should work. If the funkiness you're seeing is due to not having enough data per sample, then that approach should fix it.

    Geraldine Van der Auwera, PhD

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,904Administrator, GATK Developer admin

    @Zaag, for trios or small families it is not necessary, but it certainly can't hurt, and may yield some modest benefits.

    Geraldine Van der Auwera, PhD

Sign In or Register to comment.