(How to) Consolidate GVCFs for joint calling with GenotypeGVCFs
In GATK4, the
GenotypeGVCFs tool can only take a single input, so if you have GVCFs from multiple samples (which is usually the case) you will need to combine them before feeding them to
GenotypeGVCFs. Although there are several tools in the GATK and Picard toolkits that provide some type of VCF or GVCF merging functionality, for this use case only two of them can do the GVCF consolidation step correctly:
GenomicsDBImport is the preferred tool (see detailed instructions below);
CombineGVCFs is provided only as a backup solution for people who cannot use
GenomicsDBImport, which has a few limitations (for example it can only run on diploid data at the moment). We know
CombineGVCFs is quite inefficient and typically requires a lot of memory, so we encourage you to try
GenomicsDBImport first and only fall back on
CombineGVCFs if you experience issues that we are unable to help you solve (ask us for help in the forum!).
GenomicsDBImport in practice
GenomicsDBImport tool takes in one or more single-sample GVCFs and imports data over a single interval, and outputs a directory containing a GenomicsDB datastore with combined multi-sample data.
GenotypeGVCFs can then read from the created GenomicsDB directly and output the final multi-sample VCF.
So if you have a trio of GVCFs your
GenomicsDBImportcommand would look like this, assuming you're running per chromosome (here we're showing the tool running on chromosome 20):
gatk GenomicsDBImport \ -V data/gvcfs/mother.g.vcf \ -V data/gvcfs/father.g.vcf \ -V data/gvcfs/son.g.vcf \ --genomicsdb-workspace-path my_database \ --intervals chr20
That generates a directory called
my_database containing the combined GVCF data for chromosome 20. The contents of the directory are not really human-readable; see further down for tips to deal with that.
Then you run joint genotyping; note the
gendb:// prefix to the database input directory path.
gatk GenotypeGVCFs \ -R data/ref/ref.fasta \ -V gendb://my_database \ -G StandardAnnotation -newQual \ -O test_output.vcf
And that's all there is to it.
You can't add data to an existing database; you have to keep the original GVCFs around and reimport them all together when you get new samples. For very large numbers of samples, there are some batching options.
At the moment you can only run
GenomicsDBImporton a single genomic interval (ie max one contig) at a time. Down the road this will change (the work is tentatively scheduled for the second quarter of 2018), because we want to make it possible to run on one multiple intervals in one go. But for now you need to run on each interval separately. We recommend scripting this of course.
GenomicsDBImportcannot accept multiple GVCFs for the same sample, so if for example you generated separate GVCFs per chromosome for each sample, you'll need to either concatenate the chromosome GVCFs to produce a single GVCF per sample (using CatVariants) or scatter the following steps by chromosome as well.
**If you can't use
GenomicsDBImport for whatever reason, fall back to
CombineGVCFs instead. It is slower but will allow you to combine GVCFs the old-fashioned way. **
Addendum: extracting GVCF data from the GenomicsDB
If you want to generate a flat multisample GVCF file from the GenomicsDB you created, you can do so with
SelectVariants as follows:
gatk SelectVariants \ -R data/ref/ref.fasta \ -V gendb://my_database \ -O combined.g.vcf