(How to) Consolidate GVCFs for joint calling with GenotypeGVCFs
In GATK4, the
GenotypeGVCFs tool can only take a single input i.e., 1) a single single-sample GVCF 2) a single multi-sample GVCF created by CombineGVCFs or 3) a GenomicsDB workspace created by GenomicsDBImport. If you have GVCFs from multiple samples (which is usually the case) you will need to combine them before feeding them to
GenotypeGVCFs. The input samples must possess genotype likelihoods containing the allele produced by HaplotypeCaller with
-ERC GVCF or
Although there are several tools in the GATK and Picard toolkits that provide some type of VCF merging functionality, for this use case ONLY two of them can do the GVCF consolidation step correctly:
GenomicsDBImport is the preferred tool (see detailed instructions below);
CombineGVCFs is provided only as a backup solution for people who cannot use
GenomicsDBImport. We know
CombineGVCFs is quite inefficient and typically requires a lot of memory, so we encourage you to try
GenomicsDBImport first and only fall back on
CombineGVCFs if you experience issues that we are unable to help you solve (ask us for help in the forum!).
GenomicsDBImport in practice
GenomicsDBImport tool takes in one or more single-sample GVCFs and imports data over at least one genomics interval (this feature is available in v126.96.36.199 and later and stable in v188.8.131.52 and later), and outputs a directory containing a GenomicsDB datastore with combined multi-sample data.
GenotypeGVCFs can then read from the created GenomicsDB directly and output the final multi-sample VCF.
So if you have a trio of GVCFs your
GenomicsDBImport command would look like this, assuming you're running per chromosome (here we're showing the tool running on chromosome 20 and chromosome 21):
gatk GenomicsDBImport \ -V data/gvcfs/mother.g.vcf \ -V data/gvcfs/father.g.vcf \ -V data/gvcfs/son.g.vcf \ --genomicsdb-workspace-path my_database \ --intervals chr20,chr21
That generates a directory called
my_database containing the combined GVCF data for chromosome 20 and 21. (The contents of the directory are not really human-readable; see “extracting GVCF data from a GenomicsDB” to evaluate the combined, pre-genotyped data. Also note that the log will contain a series of messages like
Buffer resized from 178298bytes to 262033 -- this is expected.) For larger cohort sizes, we recommend specifying a batch size of 50 for improved memory usage. A sample map file can also be specified when enumerating the GVCFs individually as above becomes arduous.
Then you run joint genotyping; note the
gendb:// prefix to the database input directory path. Note that this step requires a reference, even though the import can be run without one.
gatk GenotypeGVCFs \ -R data/ref/ref.fasta \ -V gendb://my_database \ -newQual \ -O test_output.vcf
And that's all there is to it.
Important limitations and Common “Gotchas”:
You can't add data to an existing database; you have to keep the original GVCFs around and reimport them all together when you get new samples. For very large numbers of samples, there are some batching options.
At least one interval must be provided when using
Input GVCFs cannot contain multiple entries for a single genomic position
GenomicsDBImportcannot accept multiple GVCFs for the same sample, so if for example you generated separate GVCFs per chromosome for each sample, you'll need to either concatenate the chromosome GVCFs to produce a single GVCF per sample (using GatherVcfs) or scatter the following steps by chromosome as well.
The annotation counts specified in the header MUST BE VALID! If not, you may see an error like
A fatal error has been detected by the Java Runtime Environment [...] SIGSEGVwith mention of a core dump (which may or may not be output depending on your system configuration.) You can check you annotation headers with vcf-validator from VCFtools [https://github.com/vcftools/vcftools]
GenomicsDB will not overwrite an existing workspace. To rerun an import, you will have to manually delete the workspace before running the command again.
If you’re working on a POSIX filesystem (e.g. Lustre, NFS, xfs, ext4 etc), you must set the environment variable TILEDB_DISABLE_FILE_LOCKING=1 before running any GenomicsDB tool. If you don’t, you will likely see an error like
Could not open array genomicsdb_array at workspace:[...]
HaplotypeCaller output containing MNPs cannot be merged with CombineGVCFs or GenotypeGVCFs. For phasing nearby variants in multi-sample callsets, MNPs can be inferred from the phase set (
PS) tag in the FORMAT field.
There are a few other, rare bugs we’re in the process of working out. If you run into problems, you can check the open github issues [https://github.com/broadinstitute/gatk/issues?utf8=✓&q=is:issue+is:open+genomicsdb] to see if a fix is in in progress.
If you can't use
GenomicsDBImport for whatever reason, fall back to
CombineGVCFs instead. It is slower but will allow you to combine GVCFs the old-fashioned way.
Addendum: extracting GVCF data from the GenomicsDB
If you want to generate a flat multisample GVCF file from the GenomicsDB you created, you can do so with
SelectVariants as follows:
gatk SelectVariants \ -R data/ref/ref.fasta \ -V gendb://my_database \ -O combined.g.vcf
Bells and Whistles
GenomicsDB now supports allele-specific annotations [ https://software.broadinstitute.org/gatk/documentation/article?id=9622 ], which have become standard in our Broad exome production pipeline.
GenomicsDB can now import directly from a Google cloud path (i.e.
gs://) using NIO.