Questions about data processing per lane vs per sample

Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
This discussion was created from comments split from: Which data processing steps should I do per-lane vs. per-sample?.

Comments

  • tcezardtcezard EdinburghMember

    Hi
    I was wondering why you need to do the dedup/realign preprocessing before and after merging. isn't it better to do it only after merging?
    It seems to me that all 3 processes (dedup/realign/base recal) can be done after merging so why not doing it only then.
    Cheers
    Tim

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @tcezard, it's mainly a problem of computational cost. Running these operations on multiple lanes or multiple samples means handling a lot of data simultaneously, which makes the computational cost (which includes both money and time) increase very quickly. It's easier/cheaper to process lanes' worth of data separately, and the results are good enough.

  • tcezardtcezard EdinburghMember

    Hi Geraldine
    Thank you for the clarification.
    This make sense for base recalibration but not really for duplicate marking and indel realignment since you're advising to do it after merging as well.
    Will indel realignment be faster/cheaper the second time around?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Will indel realignment be faster/cheaper the second time around?

    No, actually the computational difficulty generally increases once you merge the data, and that's the point. In some regions, processing the data from multiple lanes will simply fail (because there is too much data) so the program will skip the region gracefully (as opposed to crashing). So if you haven't previously done per-lane cleanup, the entire region will remain a mess throughout and it's less likely that any good calls will come out of it. However, if you have already done per-lane cleanup (which is less likely to fail in any given spot because it involves less data), then the region will be at least partly cleaned up and you will still have a decent chance of getting decent calls out of it. It's a way of hedging your bets; kind of a Pascal's wager of data pre-processing, if that makes sense.

  • EugenieEugenie Member
    edited December 2013

    Dear Geraldine,
    first of all thank you for pointing on this article.
    Please can you help me to clarify some issues?

    1) if it makes sense to recalibrate across samples which were sequenced on the same lane? it seems that the more data the better for BQSR but I am not sure how to deal with different samples
    (I have data from 2 lanes for each sample but at the same time there are several samples on each lane - sorry, if it's a bit confusing;) actually the most stupid question is if I have to define different read group ID for the samples from the same lane - assuming that then they will be processed together during snp calling)

    2) thank you for such important remark about contrastive projects - it didn't realize it. But it's not clear for me how to do it: is it possible to realign across samples without merging bam's? (if finally I want to have separate bam's to proceed with Mutect)

    Thank you very much for your time and patience

    Post edited by Geraldine_VdAuwera on
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @Eugenie,

    1. I understand what you mean, here at the Broad we also multiplex samples per lane. Typically we de-multiplex the data from each lane before processing. So if you have e.g. two three samples (A, B and C), each sequenced on two lanes (1 and 2), you will need to separate the data into six different files: A1, A2, B1, B2, C1, C2. For the data processing steps, you treat each of these as if they were a separate lane of data. For BQSR, it is true that it is better to have as much data together as possible, so if you have very many samples multiplex per lane this can become a problem, but if it's only a few then you will be fine as long as you run on the whole genome (or exome) rather than per contig (which people sometimes do to go faster). Regarding the read group IDs, just indicate which sample each belongs to, and which lane they were sequenced on. GATK tools will correctly aggregate data per sample in later steps.

    2. Yes, you can do this by using the -nWayOut argument. This will realign across samples but output separate files per sample instead of one merged bam.

  • chunxuanchunxuan Member

    Dear @Geraldine_VdAuwera,

    Thanks for the guide, just want to be clear: after merge, dedup and realn are recommend, but not recal ? Because recal will use far too much resource, and pay-off is not good?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @chunxuan,

    Recalibration is performed at the lane level only because it aims to correct systematic machine errors that are specific to each lane.

  • EugenieEugenie Member
    edited January 2014

    Thank you so much for an explicit reply, Geraldine!

    1. Yes, it's WES and around 10 samples per lane, which I was afraid is quite a lot...
      2.Thanks! Sorry, I missed it somehow. Extremely useful option;)

    But one more question about the tumor/normal realignment . Would you recommend to perform the realignment first per lane and then across tumor-normal or it's better to include one more realignment step (per sample after merging lanes). As I understand it's always a question of power you need to handle multiple lanes as you already explained - so to be on the safe side it worth to realign 3 times in such case? Sorry if it's too specific

    Post edited by Geraldine_VdAuwera on
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @Eugenie,

    For tumor/normal pairs you'll definitely want to do a realignment step on the data from both tumor and normal together. I'm not sure doing three realignment steps is worth it -- I would probably do realignment first per lane and then across tumor-normal. That doesn't represent a crazy amount of data; the power issue mostly affects multisample realignment of larger cohorts, whereas a tumor/normal pair is still fairly small potatoes in comparison.

  • EugenieEugenie Member

    Thank you, Geraldine! Yes, seems reasonable. Otherwise I'll realign forewer) And thanks again for the remark concerning contrastive projects

  • vifehevifehe SpainMember

    Hi Geraldine,

    probably I should have read this post before doing pre-processing of data, auch,
    I have WES data, one sample every two lanes, i.e. exo_1 and exo_2 (which I assume are forward and reverse). I did merge the lanes when I generated the SAM file doing:

    bwa mem -M -R "Read group info" reference.fa exo1.fq exo2.fq > aligned_exo.sam

    is that correct? or should I have pre-processed exo1 and exo2 separately and merged them after separately doing dedup realignment and recalibration?

    Thanks in advance!

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @vifehe‌,

    I would strongly recommend you check with your sequence data provider to make sure exactly what the two files correspond to. I would not recommend proceeding on assumptions only. Make sure you understand the difference between "lanes" as a unit of sequencing and "forward/reverse" as a property of sequenced reads.

  • achenge07achenge07 Beijing,CHINAMember

    Hi, lovely @Geraldine_VdAuwera‌ ,

    Firstly thanks for your great job!

    My question is:

    If I have 100 tumor samples and 100 normal samples, since you recommend realigning both the tumor and the normal together in general, is it ok for me to put all 200 samples into the realign tool in one step?

    whereas a tumor/normal pair is still fairly small potatoes in comparison.

    Or you meant 2 samples(tumor and paraneoplastic tissue from same person?) a time? If yes, how to do if No.(tumor samples) != No.(normal samples) or if each person only has one sample?

    sorry about my poor English :(

    yaowen

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @achenge07‌,

    I meant that you should process together samples that are matched, e.g. tumor and normal (or other type of tissue) from the same person. If some of your samples are not matched to others, you can process them individually.

  • achenge07achenge07 Beijing,CHINAMember

    Got it! Thank you Geraldine!

  • ForamForam Member

    We're doing a targeted sequencing project, multiplexing 36-48 samples per run.
    I ran BQSR per sample and it gives some funky results. Do you have any thoughts on whether creating a merged multi-sample file per lane to create the recal table, and then applying said recal table to individual samples would be a good idea?
    Thanks!

  • ZaagZaag Member ✭✭

    @Geraldine_VdAuwera said:
    Hi achenge07‌,

    I meant that you should process together samples that are matched, e.g. tumor and normal (or other type of tissue) from the same person. If some of your samples are not matched to others, you can process them individually.

    Same for a trio or other small families?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @Foram,

    Yes, I think that should work. If the funkiness you're seeing is due to not having enough data per sample, then that approach should fix it.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @Zaag, for trios or small families it is not necessary, but it certainly can't hurt, and may yield some modest benefits.

  • wchenwchen Member

    @Geraldine_VdAuwera said:
    Hi achenge07‌,

    I meant that you should process together samples that are matched, e.g. tumor and normal (or other type of tissue) from the same person. If some of your samples are not matched to others, you can process them individually.

    How would you realign matched tumor and normal together exactly? Example of commands? Thanks!

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @wchen‌

    Have a look at this article: http://gatkforums.broadinstitute.org/discussion/3060/which-data-processing-steps-should-i-do-per-lane-vs-per-sample/

    Basically, you will merge the tumor and normal files from each individual using samtools merge (http://samtools.sourceforge.net/samtools.shtml). Then, you will dedup and realign those bam files to ensure consistency in the individual.

    I hope this helps.

    -Sheila

  • wchenwchen Member

    @Geraldine_VdAuwera said:
    Hi Eugenie,

    1. I understand what you mean, here at the Broad we also multiplex samples per lane. Typically we de-multiplex the data from each lane before processing. So if you have e.g. two three samples (A, B and C), each sequenced on two lanes (1 and 2), you will need to separate the data into six different files: A1, A2, B1, B2, C1, C2. For the data processing steps, you treat each of these as if they were a separate lane of data. For BQSR, it is true that it is better to have as much data together as possible, so if you have very many samples multiplex per lane this can become a problem, but if it's only a few then you will be fine as long as you run on the whole genome (or exome) rather than per contig (which people sometimes do to go faster). Regarding the read group IDs, just indicate which sample each belongs to, and which lane they were sequenced on. GATK tools will correctly aggregate data per sample in later steps.

    2. Yes, you can do this by using the -nWayOut argument. This will realign across samples but output separate files per sample instead of one merged bam.

    @Geraldine_VdAuwera said:
    Hi Eugenie,

    For tumor/normal pairs you'll definitely want to do a realignment step on the data from both tumor and normal together. I'm not sure doing three realignment steps is worth it -- I would probably do realignment first per lane and then across tumor-normal. That doesn't represent a crazy amount of data; the power issue mostly affects multisample realignment of larger cohorts, whereas a tumor/normal pair is still fairly small potatoes in comparison.

    Hi Geraldine,

    So for tumor/normal pairs, you'll do first round of processing by each lane separately, then second round merging tumor/normal bam files using samtools and do realignment across tumor-normal for each individual, using -nWayOut to generate separate realigned bam files. Am I right? Thanks!

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @wchen‌

    Hi,

    You are correct that you first do a round of processing for each lane separately.

    For the second round of processing:

    If you would like a single bam file for each individual including tumor and normal samples, then you can simply input the tumor and normal bam files for each individual into indel realigner, and it will output a merged bam file. There is actually no need to use samtools merge before using indel realigner, as indel realigner will output a merged bam file for you.

    If you want separate output files for the tumor and normal samples, you can use the -nWayOut argument in indel realigner. Please read about the -nWayOut argument here: https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_indels_IndelRealigner.html#--nWayOut

    -Sheila

  • wchenwchen Member

    So for tumor/normal, indel realigner will take multiple input files and realign together, am I right? If yes, it solved all my confusions. Thanks a lot!

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @wchen‌

    Yes, you are correct.

    -Sheila

  • shangzhong0619shangzhong0619 La JollaMember

    Hello,
    I have a question. When I have biological replicates for a sample and each one has different lanes, after merging the lanes, should I also merge the biological replicates? thanks.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @shangzhong0619‌

    Hi,

    Unfortunately, we cannot help you with this. How to process biological replicates depends on your experimental design; both merging and not-merging approaches are potentially valid. It is up to you to make the decision based on why biological replicates were produced in the first place.

    Good luck.

    -Sheila

Sign In or Register to comment.