The frontline support team will be unavailable to answer questions on April 15th and 17th 2019. We will be back soon after. Thank you for your patience and we apologize for any inconvenience!
GenomicsDB is a datastore format developed by our collaborators at Intel to store variant call data (where "datastore" = something that we mere mortals can think of as a database, even though IT professionals insist that it's a completely different thing). The long-term vision is that ultimately we will use this datastore format as an alternative to VCF files for storing and working with variant data. For now though, we are only actively using it as a GVCF consolidation tool in the germline joint-calling workflow.
Note that at the moment GenomicsDB only supports diploid data; our Intel collaborators are working on implementing support for non-diploid data, but in the meantime if you need to work with non-diploid data you'll need to use CombineGVCFs instead.
There are currently three supported operations you can do with a GenomicsDB datastore: create a new GenomicsDB datastore from one or more GVCFs, joint-call it, and extract sample data from it.
- Create a new GenomicsDB datastore from one or more GVCFs
- Joint-call samples in a GenomicsDB datastore
- Extract data from a GenomicsDB datastore
1. Create a new GenomicsDB datastore from one or more GVCFs
The goal of this operation is to consolidate a set of GVCFs into a single datastore that
GenotypeGVCFs can run on (because
GenotypeGVCFs can only take a single input). To do this via GenomicsDB, we use the
GenomicsDBImport tool. This tool takes in one or more single-sample GVCFs and imports data over at least one genomics interval (this feature is available in v126.96.36.199 and later and stable in v188.8.131.52 and later), and outputs a directory containing a GenomicsDB datastore with combined multi-sample data. GenotypeGVCFs can then read from the created GenomicsDB directly and output the final multi-sample VCF.
Here's what a typical command looks like:
gatk-launch GenomicsDBImport \ -V data/gvcfs/mother.g.vcf \ -V data/gvcfs/father.g.vcf \ -V data/gvcfs/son.g.vcf \ --genomicsDBWorkspace my_database \ --intervals chr20,chr21
This command generates a directory called
my_database containing the combined GVCF data.
Note that the GVCFs can also be passed in as a list or map instead of being enumerated in the command.
Note also that at the moment you can't add data to an existing database; you have to keep the original GVCFs around and reimport them all together when you get new samples. For very large numbers of samples, there are some batching options that help make this reasonably quick. Overall it's much more scalable than the old CombineGVCFs route anyway (sorry, non-diploids!).
2. Joint-call samples in a GenomicsDB datastore
Once you have a GenomicsDB datastore containing GVCF data from one or more sample, you can run GenotypeGVCFs on it to joint-call the samples it contains.
Here's an example command:
gatk-launch GenotypeGVCFs \ -R data/ref/ref.fasta \ -V gendb://my_database \ -G StandardAnnotation -newQual \ -O test_output.vcf
This will produce a multi-sample VCF with all the usual bells and whistles.
gendb:// prefix to the database input directory path. That's the only difference compared to a regular GenotypeGVCFs command, but it's an important one -- if you forget the prefix you will get a big fat error.
3. Extract data from a GenomicsDB datastore
If you want to generate a flat multisample GVCF file from a GenomicsDB you created, you can do so with SelectVariants as follows:
gatk-launch SelectVariants \ -R data/ref/ref.fasta \ -V gendb://my_database \ -O combined.g.vcf
You can use any of the usual SelectVariants modifiers to extract e.g. only a subset of samples, a subset of genomic intervals, and so on. This can be useful for troubleshooting variant calls, when you feel the need to look at what the intermediate GVCF looked like, for example, since it's not possible to view the calls in the GenomicsDB itself in a human-readable way.