If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!
Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
A number of GATK tools produce or take in HDF5 format data (1; 2), e.g. CollectReadCounts and CreateReadCountPanelOfNormals in the CNV workflow. Some tools like CollectReadCounts also allow for writing to TSV format, but use HDF5 by default due to some advantages, which we cover below.
In this article, we use CNV workflow data to illustrate features of the HDF5 format. To generate the same data for yourself, see Tutorial#11682. Before jumping into HDF5, we first consider the alternative and more familiar format, TSV, in section 1. Section 2 then goes into the details of HDF5, and section 3 outlines how to navigate HDF5 data using HDFView.
1. TSV data is flat
A TSV file is a text file that contains rows of tab-separated values. For genomics data, a header at the start of the file can hold metadata in rows that each start with a special symbol, e.g. an at symbol (
@) or a hashtag symbol (
#). The main data then follows in the body of the file. For example, if we run the
tumor.bam file in Tutorial#11136 through CollectFragmentCounts, the tool will produce a TSV file containing, in order,
@RG rows, a row labeling the columns and then the table of data. Here’s a snippet of the data that shows the three components.
We see CollectFragmentCounts uses a custom tag (
@RG) with a workflow identifier (
ID) and the sample name (
2. HDF5 format is multidimensional
HDF stands for Hierarchical Data File and 5 denotes the version. Accordingly, HDF5 format organizes data hierarchically, i.e. aggregates data by dimensions.
HDF5 is a data format that allows storing different types of data in a single file. This may sound similar to how a TSV differentiates metadata and data in the header and in the body of a single file. However, HDF5 can go beyond such a simple breakdown and is not text-based. What this means is you cannot view HDF5 data with a text viewer nor command-line utilities such as
An analogy for the HDF5 format is the structure of folders and files on a computer. Consider the CNV panel of normals from Tutorial#11682,
cnvponM.pon.hdf5. Think of
cnvponM.pon.hdf5 as a folder or directory. This main folder is called the root group and in it are subdirectories that each hold a dataset, e.g. the raw counts data from the normal samples or the decomposed panel data. In turn, a dataset contains (i) the actual data, e.g. a table of normal samples against their counts for genomic intervals, plus (ii) what biomedical researchers consider metadata, e.g. sample file names for the table rows.
As mentioned, HDF5 groups collections of multi-dimensional arrays. The downstream advantage of this is that the data is accessible in a structure that allows for efficient I/O, or access by the computer. Moreover, HDF5 data includes metadata such that the file is self-describing. The advantage of this is that HDF5 data is easily analyzable by various languages such as R and Python.
To get technical, HDF5 metadata, or attributes, actually refers to information on objects and includes descriptions of the object’s dataspace and datatype. The former describes something called arrayness, while the latter describes individual data elements in a dataset. For example, consider as an object the raw panel of normals counts data. For this data object, the dataspace has two dimensions. The size of the first dimension corresponds to the number of samples in the panel and the second dimension corresponds to the number of genomic intervals. The datatype 64-bit floating-point refers to encoding fractional numerical values in 64-bit binary.
- For an even more technical introduction to HDF5, watch Quincey Koziol from the HDF Group.
Finally, HDF5 can store data elements in a variety of ways, e.g. contiguous, chunks or compressed chunks. In a contiguous store, datasets are in a single block. Think of chunks as groups of tiles that divvy up a tiled floor into equally sized arrays. Careful selection of chunk size goes hand-in-hand with maximizing I/O performance. To extend the tiled floor analogy, imagine you and a number of helpers can each clean a portion of the floor simultaneously. You would want to chunk the portions to areas large enough to avoid blocking one another and also choose the number of chunks to match the number of people cleaning.
- For more discussion on chunking, see this external blog post by geologist Joe Kington.
3. View HDF5 format data with HDFView
Download the Java browser application HDFView from support.hdfgroup.org. Note there are other ways to view HDF5 format data, e.g. using PyTables or h5py.
If we open an HDF5 file produced by CollectFragmentCounts, we see the following. The root directory is
tumor.counts.hdf5. It contains four datasets and each is represented by a manila folder icon. Click on the triangle next to a folder to expand its contents.
Click on an object within a subfolder to view its HDF5 metadata. Above, we see the object metadata for
counts>values in the panel to the right. Notice for this object the Storage Layout is CONTIGUOUS.
Double-click on the object to view the actual data. For each data object, a new window will open. Four of the five objects are open in the screenshot below for results from CollectFragmentCounts.