Using GenomicsDBImport to consolidate GVCFs for input to GenotypeGVCFs in GATK4
This article is out of date and has been replaced by https://software.broadinstitute.org/gatk/documentation/article?id=11813
In GATK4, the
GenotypeGVCFs tool can only take a single input, so if you have GVCFs from multiple samples (which is usually the case) you will need to combine them before feeding them to
GenotypeGVCFs. Although there are several tools in the GATK and Picard toolkits that provide some type of VCF or GVCF merging functionality, for this use case there is only one valid way to do it: with
GenomicsDBImport tool takes in one or more single-sample GVCFs and imports data over a single interval, and outputs a directory containing a GenomicsDB datastore with combined multi-sample data.
GenotypeGVCFs can then read from the created GenomicsDB directly and output a VCF. Note that GenomicsDBImport does not take two or more same sample GVCFs. You will need to create one GVCF per-sample before running the tool.
Here are example commands to use it:
gatk-launch GenomicsDBImport \ -V data/gvcfs/mother.g.vcf \ -V data/gvcfs/father.g.vcf \ -V data/gvcfs/son.g.vcf \ --genomicsDBWorkspace my_database \ --intervals 20
That generates a directory called
my_database containing the combined gvcf data.
Then you run joint genotyping; note the
gendb:// prefix to the database input directory path.
gatk-launch GenotypeGVCFs \ -R data/ref/ref.fasta \ -V gendb://my_database \ -G StandardAnnotation -newQual \ -O test_output.vcf
And that's all there is to it.
There are three caveats:
You can't add data to an existing database; you have to keep the original GVCFs around and reimport them all together when you get new samples. For very large numbers of samples, there are some batching options.
At the moment you can only run
GenomicsDBImporton a single genomic interval (ie max one contig) at a time. Down the road this will change (the work is tentatively scheduled for the second quarter of 2018), because we want to make it possible to run on one multiple intervals in one go. But for now you need to run on each interval separately. We recommend scripting this of course.
At the moment GenomicsDB only supports diploid data. The developers of GenomicsDB are working on implementing support for non-diploid data.
If you can't use GenomicsDB for whatever reason, fall back to CombineGVCFs instead. It is slower but will allow you to combine GVCFs the old-fashioned way.
Addendum: extracting data from the GenomicsDB
If you want to generate a flat multisample GVCF file from the GenomicsDB you created, you can do so with SelectVariants as follows:
gatk-launch SelectVariants \ -R data/ref/ref.fasta \ -V gendb://my_database \ -O combined.g.vcf
Caveat: cannot move database after creation
Currently the GenomicsDB internal code uses the absolute path of the location of the database as part of the data encoding. As a consequence, you cannot move the database to a different location before running GenotypeGVCFs on it. If you do, it will no longer work. This is obviously not desirable, and the development team is looking at options to remediate this.