Lane, Library, Sample and Cohort -- what do they mean and why are they important?

Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,751Administrator, GATK Dev admin
edited August 2013 in FAQs

There are four major organizational units for next-generation DNA sequencing processes that used throughout the GATK documentation:

  • Lane: The basic machine unit for sequencing. The lane reflects the basic independent run of an NGS machine. For Illumina machines, this is the physical sequencing lane.

  • Library: A unit of DNA preparation that at some point is physically pooled together. Multiple lanes can be run from aliquots from the same library. The DNA library and its preparation is the natural unit that is being sequenced. For example, if the library has limited complexity, then many sequences are duplicated and will result in a high duplication rate across lanes.

  • Sample: A single individual, such as human CEPH NA12878. Multiple libraries with different properties can be constructed from the original sample DNA source. Throughout our documentation, we treat samples as independent individuals whose genome sequence we are attempting to determine. Note that from this perspective, tumor / normal samples are different despite coming from the same individual.

  • Cohort: A collection of samples being analyzed together. This organizational unit is the most subjective and depends very specifically on the design goals of the sequencing project. For population discovery projects like the 1000 Genomes, the analysis cohort is the ~100 individual in each population. For exome projects with many deeply sequenced samples (e.g., ESP with 800 EOMI samples) we divide up the complete set of samples into cohorts of ~50 individuals for multi-sample analyses.

Note that many GATK commands can be run at the lane level, but will give better results seeing all of the data for a single sample, or even all of the data for all samples. Unfortunately, there's a trade-off in computational cost, since running these commands across all of your data simultaneously requires much more computing power. Please see the documentation for each step to understand what is the best way to group or partition your data for that particular process.

Post edited by Geraldine_VdAuwera on

Geraldine Van der Auwera, PhD



  • rxy712rxy712 Posts: 18Member

    Thank you for the information, which is helpful! I am wondering where to find the library information to put into the read group. I still do not quite understand library. I have flowcell ID or sample ID, not sure if either is the right one? I know that library information is used for markduplicates, but without that information, would the result change a lot?

    Thank you very much!

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,751Administrator, GATK Dev admin

    It depends on your experimental design. Do you know how many libraries were prepared for each sample, and how they were arranged on the flowcells?

    Geraldine Van der Auwera, PhD

  • rxy712rxy712 Posts: 18Member

    Let's say, if I have paired tumor normal samples for 10 patients, then there are 20 samples in total. I think in this case, sample ID is the the same as library ID, and I have 20 libraries, right? Thank you!

  • rxy712rxy712 Posts: 18Member

    added: each sample is sequenced in the same flowcell but on different lanes.

  • SheilaSheila Broad InstitutePosts: 2,402Member, GATK Dev, Broadie, Moderator, DSDE Dev admin


    You can put all the information into proper read groups with the help of this article:


Sign In or Register to comment.