Holiday Notice:
The Frontline Support team will be slow to respond December 17-18 due to an institute-wide retreat and offline December 22- January 1, while the institute is closed. Thank you for your patience during these next few weeks. Happy Holidays!

What is uBAM and why is it better than FASTQ for storing unmapped sequence data?

Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
edited December 2016 in Frequently Asked Questions

Most sequencing providers generate FASTQ files with the raw unmapped read sequences, so that is the most common form in which the data is input into the mapping step of the pre-processing pipeline. This is not ideal because among other flaws, much of the metadata associated with sequencing runs cannot be stored in FASTQ files, unlike BAM files which can store more information. See this blog post for an overview of the many problems associated with the FASTQ format.

At the Broad Institute, we generate unmapped BAM (uBAM) files directly from the Illumina basecalls in order to keep all metadata in one place, and we do not write the data to FASTQ files at any point. This involves a slightly more complex workflow than is shown in the general Best Practices diagram. See this presentation for more details of how this works.

In case you're wondering, we still show the FASTQ-based workflow as the default in most of our documentation because it is by far the most commonly-used workflow, and we want to keep the documentation accessible for our more novice users.

Post edited by Geraldine_VdAuwera on

Comments

  • Brian_BushnellBrian_Bushnell Walnut CreekMember

    Unmapped bam files are larger than gzipped fastq files. They contain less information - specifically, anything after the first whitespace in a read name is truncated, meaning that any program expecting the original Illumina names will have trouble, and probably treat paired data as single-ended, because the read names were mutilated as required by the sam format to force read 1 and read 2 to have identical names, even though they originally had different names.

    Gzipped fastq files compress faster and smaller than your so-called ubam files. They decompress faster. And by faster... I mean, it's like twice as fast. Why are you recommending a lossy compression format over a lossless compression format that is twice as fast and smaller?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
    What can I say, we like to have our metadata attached to the reads from as early on as possible. It helps keep things under control when you're processing a whole genome's worth of data every ten minutes.
  • sklagessklages Member

    I'd go for BAM as well. Especially when it comes to metadata. File size is not really an issue. And if it takes longer to decompress BAM, who cares? I think samtools now decompresses multithreaded. In one thing I agree: at least the illumina fastq headers should be stored completely in one or another way. This way we could reconstruct original fastq files (if needed). PacBio's Sequel stores read data in BAM as well.

    We could invent just another format, .. but this would probably be counterproductive.

  • myourshawmyourshaw University of ColoradoMember ✭✭

    The recently published Standards and Guidelines for Validating Next-Generation Sequencing Bioinformatics Pipelines, which applies to clinical laboratories, requires that laboratory, run, and patient identifiers "must be present within the file's metadata, and ... recommends that the identifiers are also present in the file name itself". We find that uBAMS are well-suited to meet this requirement; not so sure how one could do internal, metadata in a fastq.

  • mglclinicalmglclinical USAMember

    @myourshaw , I am also reading this 2018 Paper regarding the Sample Identity preservation inside the metadata of the files, and it seems uBAMs are better than the fastq files to serve the Recommendation #10 in Guidelines

  • mglclinicalmglclinical USAMember

    @Geraldine_VdAuwera, the original link to generate unmapped BAM (uBAM) files directly from the Illumina basecalls is dead. This link (https://software.broadinstitute.org/gatk/events/slides/1506/GATKwr8-A-3-GATK_Best_Practices_and_Broad_pipelines.pdf) seems to be moved,

    Could you please point the correct location for this presentation

Sign In or Register to comment.