Calling variants on cohorts of samples using the HaplotypeCaller in GVCF mode
This document describes the new approach to joint variant discovery that is available in GATK versions 3.0 and above. For a more detailed discussion of why it's better to perform joint discovery, see this FAQ article. For more details on how this fits into the overall reads-to-variants analysis workflow, see the Best Practices workflows documentation.
This is the workflow recommended in our Best Practices for performing variant discovery analysis on cohorts of samples.
In a nutshell, we now call variants individually on each sample using the HaplotypeCaller in
-ERC GVCF mode, leveraging the previously introduced reference model to produce a comprehensive record of genotype likelihoods and annotations for each site in the genome (or exome), in the form of a gVCF file (genomic VCF).
In a second step, we then perform a joint genotyping analysis of the gVCFs produced for all samples in a cohort.
This allows us to achieve the same results as joint calling in terms of accurate genotyping results, without the computational nightmare of exponential runtimes, and with the added flexibility of being able to re-run the population-level genotyping analysis at any time as the available cohort grows.
This is meant to replace the joint discovery workflow that we previously recommended, which involved calling variants jointly on multiple samples, with a much smarter approach that reduces computational burden and solves the "N+1 problem".
This is a quick overview of how to apply the workflow in practice. For more details, see the Best Practices workflows documentation.
1. Variant calling
Run the HaplotypeCaller on each sample's BAM file(s) (if a sample's data is spread over more than one BAM, then pass them all in together) to create single-sample gVCFs, with the option
--emitRefConfidence GVCF, and using the
.g.vcf extension for the output file.
Note that versions older than 3.4 require passing the options
--variant_index_type LINEAR --variant_index_parameter 128000 to set the correct index strategy for the output gVCF.
2. Optional data aggregation step
If you have more than a few hundred samples, run CombineGVCFs on batches of ~200 gVCFs to hierarchically merge them into a single gVCF. This will make the next step more tractable.
3. Joint genotyping
Take the outputs from step 2 (or step 1 if dealing with fewer samples) and run GenotypeGVCFs on all of them together to create the raw SNP and indel VCFs that are usually emitted by the callers.
4. Variant recalibration
Finally, resume the classic GATK Best Practices workflow by running VQSR on these "regular" VCFs according to our usual recommendations.
That's it! Fairly simple in practice, but we predict this is going to have a huge impact in how people perform variant discovery in large cohorts. We certainly hope it helps people deal with the challenges posed by ever-growing datasets.
As always, we look forward to comments and observations from the research community!