What are the differences between Mutect2 and HaplotypeCaller?
They share graph assembly and haplotype determination -- but the similarities end there
Operationally, Mutect2 works similarly to HaplotypeCaller in that they share the active region-based processing, assembly-based haplotype reconstruction and pairHMM alignment of reads to haplotypes. However, they use fundamentally different models for estimating variant likelihoods and genotypes. The HaplotypeCaller model uses ploidy in its genotype likelihood calculations. The Mutect2 model does not. We explain why this is so.
Germline caller versus Somatic caller
The main difference is that HaplotypeCaller is designed to call germline variants, while Mutect2 is designed to call somatic variants. Neither is appropriate for the other use case.
Germline variants are straightforward. They vary against the reference. Germline calling typically assumes a fixed ploidy and calling includes genotyping sites. HaplotypeCaller allows setting a different ploidy than diploid with the
-ploidy argument. HaplotypeCaller can call germline variants on one or multiple samples and the tool can use evidence of variation across the samples to increase confidence in a variant call.
Somatic variants contrast between two samples against the reference. What do we mean by somatic? The Greek word soma refers to parts of an organism other than the reproductive cells. For example, our skin cells are soma-tic and accumulate mutations from sun exposure that presumably our seed or germ cells are protected from. In this example, variants in skin cells that are not variant in the blood cells are somatic.
Mutect2 works primarily by contrasting the presence or absence of evidence for variation between two samples, the tumor and matched normal, from the same individual. The tool can run on unmatched tumors but this produces high rates of false positives. Technically speaking, somatic variants are both (i) different from the control sample and (ii) different from the reference. What this means is that if a site is variant in the control but in the somatic sample reverts to the reference allele, then it is not a somatic variant.
Here are some more specific differences
- Mutect2 is incapable of calculating reference confidence, which is a feature in HaplotypeCaller that is key to producing GVCFs. As a result, there is currently no way to perform joint calling for somatic variant discovery.
- Because a somatic callset is based on a single individual rather than a cohort, annotations in the INFO column of a Mutect2 VCF only refer to the ALT alleles and do not include values for the REF allele. This differs from a germline cohort callset, in which annotations in the INFO field are typically derived from data related to all observed alleles including the reference.
- While HaplotypeCaller relies on a fixed ploidy assumption to calculate the genotype likelihoods that are the basis for genotype probabilities (PL), Mutect2 allows for varying ploidy in the form of allele fractions for each variant. Varying allele fractions are often seen within a tumor sample due to fractional purity, multiple subclones and copy number variation.
- Mutect2 also differs from HaplotypeCaller in that it can apply various prefilters to sites and alleles depending on the use of a matched normal, a panel of normals (PoN) and a common population variant resource containing allele-specific frequencies. If a PoN or matched normal is provided, Mutect2 can use either to filter sites before reassembly, and it can use a germline resource to filter alleles.
- The variant site annotations that HaplotypeCaller and Mutect2 apply by default are very different; see their respective tool documentation for details.
- Finally, Mutect2 has additional parameters not available to HaplotypeCaller. These parameters factor towards the decision to perform reassembly on a region, towards whether to emit a variant and towards whether to filter a site:
- For one, the frequency of alleles not in the germline resource (
--af-of-alleles-not-in-resource) defines in the germline variant prior, which Mutect2 uses in likelihood calculations of a variant being germline.
- Second, the log somatic prior (
--log-somatic-prior) defines the somatic variant prior, which Mutect2 uses in likelihood calculations of a variant being somatic.
- Third, the normal log odds ratio (
--normal-lod) defines the filter threshold for variants in the tumor not being in the normal, i.e. the germline risk factor.
- Fourth, the tumor log odds ratio for emission (
–-tumor-lod-to-emit) defines the cutoff for a tumor variant to appear in a callset.
- For one, the frequency of alleles not in the germline resource (
Historical perspective explains some quirks of somatic calling
Somatic calling is NOT a simple subtraction of control variant alleles from case sample variant alleles. The reason for this stems from the original intent for somatic callsets.
- Somatic calling was originally designed for cancer research--specifically, computational research that focuses on triangulating driver mutation loci in cancer cohorts. Analyses require callsets with high specificity. What this means is that researchers prefer to remove false positives even at the expense of losing some true positives.
- Another consideration is patient privacy. Germline variants, in particular those in untranslated regions or noncoding regions of the genome, deidentify individuals. To protect patient identities, somatic calling was designed to avoid passing on any identifying germline variation.
Somatic callers reflect these two preferences in their stringent filtering, either upfront such that a variant call is not emitted or downstream such that a site is annotated in the FILTER column with the filter name.
A somatic caller should detect low fraction alleles, can make no explicit ploidy assumption and omits genotyping in the traditional sense. Mutect2 adheres to all of these criteria. A number of cancer sample characteristics necessitate such caller features. For one, biopsied tumor samples are commonly contaminated with normal cells, and the normal fraction can be much higher than the tumor fraction of a sample. Second, a tumor can be heterogeneous in its mutations. Third, these mutations not uncommonly include aneuploid events that change the copy number of a cell's genome in patchwork fashion.
A variant allele in the case sample is not called if the site is variant in controls. We explain an exception for GATK4 Mutect2 in a bit.
Historically, somatic callers have called somatic variants at the site-level. That is, if a variant site in the case is also variant in the matched control or in a population resource, e.g. dbSNP, even if the variant allele is different than the control or resource it is discounted from the somatic callset. This practice stems in part from cancer study designs where the control normal sample is sequenced at much lower depth than the case tumor sample. Because of the assumption mutations strike randomly, cancer geneticists view mutations at sites of common germline variation with skepticism. Remember for humans, common germline variant sites occur roughly on average one in a thousand reference bases. So if a commonly variant site accrues additional mutations, we must weigh the chance of it having arisen from a true somatic event or it being something else that will likely not add value to downstream analyses. For most sites and typical analyses, the latter is the case. The variant is unlikely to have arisen from a somatic event and more likely to be some artifact or germline variant, e.g. from mapping or cross-sample contamination.
GATK4 Mutect2 still applies this practice in part. The tool discounts variant sites shared with the panel of normals or with a matched normal control's unambiguously variant site. If the matched normal's variant allele is supported by few reads, at low allele fraction, then the tool accounts for the possibility of the site not being a germline variant.
When it comes to the population germline resource, GATK4 Mutect2 distinguishes between the variant alleles in the germline resource and the case sample. That is, Mutect2 will call a variant site somatic if the allele differs from that in the germline resource. Blog#10911 explains this in a bit more detail and explains how Mutect2 factors germline variant allele frequencies in calling.
Somatic workflows filter case sites with multiple variant alleles. By a similar logic to that outlined above, and with the assumption that common variant sites are biallelic, any site that presents multiple variant alleles in the case sample is suspect. Mutect2 still calls such sites and the contrasting variant alleles; however, in the next step of the workflow, FilterMutectCalls filters such sites with the multiallelic filter. It is possible a multiallelic site in the case sample represents a somatic event, but it is more likely the site is a germline variant site or an artifactual site.
- Tutorial#2801 outlines how to call germline short variants with HaplotypeCaller.
- Tutorial#11136 outlines the GATK4 somatic short variant discovery workflow.
- For differences between GATK4 Mutect2 and GATK3 MuTect2, see Blog#10911.
- HaplotypeCaller tool documentation is here.
- GATK4 Mutect2 tool documentation is here.