Assigning per-sample genotypes (HaplotypeCaller)

This document describes the procedure used by HaplotypeCaller to assign genotypes to individual samples based on the allele likelihoods calculated in the previous step. For more context information on how this fits into the overall HaplotypeCaller method, please see the more general HaplotypeCaller documentation. See also the documentation on the QUAL score as well as the one on PL and GQ.

This procedure is NOT applied by Mutect2 for somatic short variant discovery. See this article for a direct comparison between HaplotypeCaller and Mutect2.


Contents

  1. Overview
  2. Preliminary assumptions / limitations
  3. Calculating genotype likelihoods using Bayes' Theorem
  4. Selecting a genotype and emitting the call record

1. Overview

The previous step produced a table of per-read allele likelihoods for each candidate variant site under consideration. Now, all that remains to do is to evaluate those likelihoods in aggregate to determine what is the most likely genotype of the sample at each site. This is done by applying Bayes' theorem to calculate the likelihoods of each possible genotype, and selecting the most likely. This produces a genotype call as well as the calculation of various metrics that will be annotated in the output VCF if a variant call is emitted.

Note that this describes the regular mode of HaplotypeCaller, which does not emit an estimate of reference confidence. For details on how the reference confidence model works and is applied in GVCF modes (-ERC GVCF and -ERC BP_RESOLUTION) please see the reference confidence model documentation.


2. Preliminary assumptions / limitations

Quality

Keep in mind that we are trying to infer the genotype of each sample given the observed sequence data, so the degree of confidence we can have in a genotype depends on both the quality and the quantity of the available data. By definition, low coverage and low quality will both lead to lower confidence calls. The GATK only uses reads that satisfy certain mapping quality thresholds, and only uses “good” bases that satisfy certain base quality thresholds (see documentation for default values).

Ploidy

Both the HaplotypeCaller and GenotypeGVCFs assume that the organism of study is diploid by default, but the desired ploidy can be set using the -ploidy argument. The ploidy is taken into account in the mathematical development of the Bayesian calculation using a generalized form of the genotyping algorithm that can handle ploidies other than 2. Note that using ploidy for pooled experiments is subject to some practical limitations due to the number of possible combinations resulting from the interaction between ploidy and the number of alternate alleles that are considered. There are some arguments that aim to mitigate those limitations but they are not fully documented yet.

Paired end reads

Reads that are mates in the same pair are not handled together in the reassembly, but if they overlap, there is some special handling to ensure they are not counted as independent observations.

Single-sample vs multi-sample

We apply different genotyping models when genotyping a single sample as opposed to multiple samples together (as done by HaplotypeCaller on multiple inputs or GenotypeGVCFs on multiple GVCFs). The multi-sample case is not currently documented for the public but is an extension of previous work by Heng Li and others.


3. Calculating genotype likelihoods using Bayes' Theorem

We use the approach described in Li 2011 to calculate the posterior probabilities of non-reference alleles (Methods 2.3.5 and 2.3.6) extended to handle multi-allelic variation.

The basic formula we use for all types of variation under consideration (SNPs, insertions and deletions) is:

$$ P(G|D) = \frac{ P(G) P(D|G) }{ \sum_{i} P(G_i) P(D|G_i) } $$

If that is meaningless to you, please don't freak out -- we're going to break it down and go through all the components one by one. First of all, the term on the left:

$$ P(G|D) $$

is the quantity we are trying to calculate for each possible genotype: the conditional probability of the genotype G given the observed data D.

Now let's break down the term on the right:

$$ \frac{ P(G) P(D|G) }{ \sum_{i} P(G_i) P(D|G_i) } $$

We can ignore the denominator (bottom of the fraction) because it ends up being the same for all the genotypes, and the point of calculating this likelihood is to determine the most likely genotype. The important part is the numerator (top of the fraction):

$$ P(G) P(D|G) $$

which is composed of two things: the prior probability of the genotype and the conditional probability of the data given the genotype.

The first one is the easiest to understand. The prior probability of the genotype G:

$$ P(G) $$

represents how probably we expect to see this genotype based on previous observations, studies of the population, and so on. By default, the GATK tools use a flat prior (always the same value) but you can input your own set of priors if you have information about the frequency of certain genotypes in the population you're studying.

The second one is a little trickier to understand if you're not familiar with Bayesian statistics. It is called the conditional probability of the data given the genotype, but what does that mean? Assuming that the genotype G is the true genotype,

$$ P(D|G) $$

is the probability of observing the sequence data that we have in hand. That is, how likely would we be to pull out a read with a particular sequence from an individual that has this particular genotype? We don't have that number yet, so this requires a little more calculation, using the following formula:

$$ P(D|G) = \prod{j} \left( \frac{P(D_j | H_1)}{2} + \frac{P(D_j | H_2)}{2} \right) $$

You'll notice that this is where the diploid assumption comes into play, since here we decomposed the genotype G into:

$$ G = H_1H_2 $$

which allows for exactly two possible haplotypes. In future versions we'll have a generalized form of this that will allow for any number of haplotypes.

Now, back to our calculation, what's left to figure out is this:

$$ P(D_j|H_n) $$

which as it turns out is the conditional probability of the data given a particular haplotype (or specifically, a particular allele), aggregated over all supporting reads. Conveniently, that is exactly what we calculated in Step 3 of the HaplotypeCaller process, when we used the PairHMM to produce the likelihoods of each read against each haplotype, and then marginalized them to find the likelihoods of each read for each allele under consideration. So all we have to do at this point is plug the values from that table into the equation above, and we can work our way back up to obtain:

$$ P(G|D) $$

for the genotype G.


4. Selecting a genotype and emitting the call record

We go through the process of calculating a likelihood for each possible genotype based on the alleles that were observed at the site, considering every possible combination of alleles. For example, if we see an A and a T at a site, the possible genotypes are AA, AT and TT, and we end up with 3 corresponding probabilities. We pick the largest one, which corresponds to the most likely genotype, and assign that to the sample.

Note that depending on the variant calling options specified in the command-line, we may only emit records for actual variant sites (where at least one sample has a genotype other than homozygous-reference) or we may also emit records for reference sites. The latter is discussed in the reference confidence model documentation.

Assuming that we have a non-ref genotype, all that remains is to calculate the various site-level and genotype-level metrics that will be emitted as annotations in the variant record, including QUAL as well as PL and GQ. For more information on how the other variant context metrics are calculated, please see the corresponding variant annotations documentation.

Post edited by Sheila on
Sign In or Register to comment.