Our documentation websites are currently offline due to a data center fire. We do not yet have an ETA for restoring service; we’ll update this message when we know more.

What is uBAM and why is it better than FASTQ for storing unmapped sequence data?

Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie
edited December 2016 in Frequently Asked Questions

Most sequencing providers generate FASTQ files with the raw unmapped read sequences, so that is the most common form in which the data is input into the mapping step of the pre-processing pipeline. This is not ideal because among other flaws, much of the metadata associated with sequencing runs cannot be stored in FASTQ files, unlike BAM files which can store more information. See this blog post for an overview of the many problems associated with the FASTQ format.

At the Broad Institute, we generate unmapped BAM (uBAM) files directly from the Illumina basecalls in order to keep all metadata in one place, and we do not write the data to FASTQ files at any point. This involves a slightly more complex workflow than is shown in the general Best Practices diagram. See this presentation for more details of how this works.

In case you're wondering, we still show the FASTQ-based workflow as the default in most of our documentation because it is by far the most commonly-used workflow, and we want to keep the documentation accessible for our more novice users.

Post edited by Geraldine_VdAuwera on

Comments

  • Brian_BushnellBrian_Bushnell Walnut CreekMember

    Unmapped bam files are larger than gzipped fastq files. They contain less information - specifically, anything after the first whitespace in a read name is truncated, meaning that any program expecting the original Illumina names will have trouble, and probably treat paired data as single-ended, because the read names were mutilated as required by the sam format to force read 1 and read 2 to have identical names, even though they originally had different names.

    Gzipped fastq files compress faster and smaller than your so-called ubam files. They decompress faster. And by faster... I mean, it's like twice as fast. Why are you recommending a lossy compression format over a lossless compression format that is twice as fast and smaller?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie
    What can I say, we like to have our metadata attached to the reads from as early on as possible. It helps keep things under control when you're processing a whole genome's worth of data every ten minutes.
  • I'd go for BAM as well. Especially when it comes to metadata. File size is not really an issue. And if it takes longer to decompress BAM, who cares? I think samtools now decompresses multithreaded. In one thing I agree: at least the illumina fastq headers should be stored completely in one or another way. This way we could reconstruct original fastq files (if needed). PacBio's Sequel stores read data in BAM as well.

    We could invent just another format, .. but this would probably be counterproductive.

Sign In or Register to comment.