Somatic calling is NOT simply a difference between two callsets
To better understand what somatic calling entails, we contrast it to germline calling. We delve into the technical details of what is the same and what is different between Mutect2 and HaplotypeCaller, a somatic caller and a germline caller, respectively. We also provides some historical context that explains some quirks of somatic calling.
Mutect2 and HaplotypeCaller share graph assembly and haplotype determination -- but similarities end there.
Operationally, Mutect2 works similarly to HaplotypeCaller in that they share the active region-based processing, assembly-based haplotype reconstruction and pairHMM alignment of reads to haplotypes. However, they use fundamentally different models for estimating variant likelihoods and genotypes. The HaplotypeCaller model uses ploidy in its genotype likelihood calculations. The Mutect2 model does not.
HaplotypeCaller is designed to call germline variants, while Mutect2 is designed to call somatic variants.
The main difference is that HaplotypeCaller is designed to call germline variants, while Mutect2 is designed to call somatic variants. Neither is appropriate for the other use case.
Germline variants are straightforward. They vary against the reference. Germline calling typically assumes a fixed ploidy and calling includes genotyping sites. HaplotypeCaller allows setting a different ploidy than diploid with the
-ploidy argument. HaplotypeCaller can call germline variants on one or multiple samples and the tool can use evidence of variation across the samples to increase confidence in a variant call. For this discussion, it is noteworthy HaplotypeCaller does not necessarily rely on a balance in the alleles in genotyping, e.g. it can call what may be considered a low allele fraction alternate allele as part of a heterozygous genotype. Furthermore, if the number of alleles at a site surpasses the ploidy assumption, then HaplotypeCaller's reference confidence mode (
-ERC GVCF) may detect and call these alleles and their respective
AD allele depths, even if the
GT genotype call uses only a subset of the alleles to fit the ploidy assumption.
Somatic variants contrast between two samples against the reference. What do we mean by somatic? The Greek word soma refers to parts of an organism other than the reproductive cells. For example, our skin cells are soma-tic and accumulate mutations from sun exposure that presumably our seed or germ cells are protected from. In this example, variants in skin cells that are not variant in the blood cells are somatic.
Mutect2 works primarily by contrasting the presence or absence of evidence for variation between two samples, the tumor and matched normal, from the same individual. The tool can run on unmatched tumors but this produces high rates of false positives. Technically speaking, somatic variants are both (i) different from the control sample and (ii) different from the reference. What this means is that if a site is variant in the control but in the somatic sample reverts to the reference allele, then it is not a somatic variant.
Technical points that highlight differences between Mutect2 and HaplotypeCaller
- Mutect2 is incapable of calculating reference confidence, which is a feature in HaplotypeCaller that is key to producing GVCFs. As a result, there is currently no way to perform joint calling for somatic variant discovery.
- Because a somatic callset is based on a single individual rather than a cohort, annotations in the INFO column of a Mutect2 VCF only refer to the ALT alleles and do not include values for the REF allele. This differs from a germline cohort callset, in which annotations in the INFO field are typically derived from data related to all observed alleles including the reference.
- While HaplotypeCaller relies on a fixed ploidy assumption to calculate the genotype likelihoods that are the basis for genotype probabilities (PL), Mutect2 allows for varying ploidy in the form of allele fractions for each variant. Varying allele fractions are often seen within a tumor sample due to fractional purity, multiple subclones and copy number variation.
- Mutect2 also differs from HaplotypeCaller in that it can apply various prefilters to sites and alleles depending on the use of a matched normal, a panel of normals (PoN) and a common population variant resource containing allele-specific frequencies. If a PoN or matched normal is provided, Mutect2 can use either to filter sites before reassembly, and it can use a germline resource to filter alleles.
- The variant site annotations that HaplotypeCaller and Mutect2 apply by default are very different; see their respective tool documentation for details.
- Finally, Mutect2 has additional parameters not available to HaplotypeCaller. These parameters factor towards the decision to perform reassembly on a region, towards whether to emit a variant and towards whether to filter a site:
- For one, the frequency of alleles not in the germline resource (
--af-of-alleles-not-in-resource) defines in the germline variant prior, which Mutect2 uses in likelihood calculations of a variant being germline.
- Second, the log somatic prior (
--log-somatic-prior) defines the somatic variant prior, which Mutect2 uses in likelihood calculations of a variant being somatic.
- Third, the normal log odds ratio (
--normal-lod) defines the filter threshold for variants in the tumor not being in the normal, i.e. the germline risk factor.
- Fourth, the tumor log odds ratio for emission (
–-tumor-lod-to-emit) defines the cutoff for a tumor variant to appear in a callset.
- For one, the frequency of alleles not in the germline resource (
Historical perspective explains some quirks of somatic calling
Somatic calling is NOT a simple subtraction of control variant alleles from case sample variant alleles. The reason for this stems from the original intent for somatic callsets in cancer research.
- First and foremost, protect patient privacy. Germline variants, in particular those in untranslated regions or noncoding regions of the genome, deidentify individuals. The evolutionary constraints mutations in coding regions are subject to, e.g. detrimental frameshift mutations, do not apply to noncoding regions and therefore differentiate the extent to which mutations can identify individuals. To protect patient identities, somatic calling was designed to avoid passing on any identifying germline variation from untranslated and noncoding regions. Somatic mutations in coding regions do not deidentify individuals and are publically sharable according to TCGA policies.
- Maximize specificity. Somatic calling in cancer research was intended to generate data for use in computational analyses. These analyses focus on triangulating cancer driver genes in cancer cohorts. Because of the number of samples within a cohort, an analysis can tolerate loss of signal from individual samples. However, by the same token, an analysis can also pick up recurrent artifacts of sequencing technology. What this means is that researchers prefer to remove the maximal number of false positives even at the expense of losing some true positives.
Somatic callers reflect these two preferences in their stringent filtering, either upfront such that a variant call is not emitted or downstream such that a site is annotated in the FILTER column with the filter name.
A somatic caller should detect low fraction alleles, can make no explicit ploidy assumption and omits genotyping in the traditional sense. Mutect2 adheres to all of these criteria. A number of cancer sample characteristics necessitate such caller features. For one, biopsied tumor samples are commonly contaminated with normal cells, and the normal fraction can be much higher than the tumor fraction of a sample. Second, a tumor can be heterogeneous in its mutations. Third, these mutations not uncommonly include aneuploid events that change the copy number of a cell's genome in patchwork fashion.
A variant allele in the case sample is not called if the site is variant in controls. We explain an exception for GATK4 Mutect2 in a bit.
Historically, somatic callers have called somatic variants at the site-level. That is, if a variant site in the case is also variant in the matched control or in a population resource, e.g. dbSNP, even if the variant allele is different than the control or resource it is discounted from the somatic callset. This practice stems in part from cancer study designs where the control normal sample is sequenced at much lower depth than the case tumor sample. Because of the assumption mutations strike randomly, cancer geneticists view mutations at sites of common germline variation with skepticism. Remember for humans, common germline variant sites occur roughly on average one in a thousand reference bases. So if a commonly variant site accrues additional mutations, we must weigh the chance of it having arisen from a true somatic event or it being something else that will likely not add value to downstream analyses. For most sites and typical analyses, the latter is the case. The variant is unlikely to have arisen from a somatic event and more likely to be some artifact or germline variant, e.g. from mapping or cross-sample contamination.
GATK4 Mutect2 still applies this practice in part. The tool discounts variant sites shared with the panel of normals or with a matched normal control's unambiguously variant site. If the matched normal's variant allele is supported by few reads, at low allele fraction, then the tool accounts for the possibility of the site not being a germline variant.
When it comes to the population germline resource, GATK4 Mutect2 distinguishes between the variant alleles in the germline resource and the case sample. That is, Mutect2 will call a variant site somatic if the allele differs from that in the germline resource. Blog#10911 explains this in a bit more detail and explains how Mutect2 factors germline variant allele frequencies in calling.
Somatic workflows filter case sites with multiple variant alleles. By a similar logic to that outlined above, and with the assumption that common variant sites are biallelic, any site that presents multiple variant alleles in the case sample is suspect. Mutect2 still calls such sites and the contrasting variant alleles; however, in the next step of the workflow, FilterMutectCalls filters such sites with the multiallelic filter. It is possible a multiallelic site in the case sample represents a somatic event, but it is more likely the site is a germline variant site or an artifactual site.
The panel of normals helps filter systematic artifacts of sequencing. Artifacts are seeming variants in the read data that are in fact false positives. Sequencing technology's artifacts are not all random. Some artifacts come from sample preparation and present in specific sequence contexts. Other artifacts come from mapping. These artifacts often appear like low allele fraction somatic mutations. When somatic callsets are gathered in a cohort, these artifacts can present a strong signal, as they occur systematically in some fraction of samples. To remove such false signals, Mutect2 filters sites present in a given panel of normals (specified with
-pon). Typically, a PoN is constructed with germline normal samples. First, calls are made using the same sensitivity as that used in somatic calling, i.e. with Mutect2. Second, the multiple normal samples are gathered into a cohort. Finally, the panel retains sites called in two or more samples. GATK4's CreateSomaticPanelOfNormals performs these latter two steps. Use of a PoN constructed from germline normals has the added benefit of filtering common germline variant sites. This is especially useful for somatic analysis of species that lack a common germline variant resource.
- Tutorial#2801 outlines how to call germline short variants with HaplotypeCaller.
- Tutorial#11136 outlines the GATK4 somatic short variant discovery workflow.
- For differences between GATK4 Mutect2 and GATK3 MuTect2, see Blog#10911.
- HaplotypeCaller tool documentation is here.
- GATK4 Mutect2 tool documentation is here.