**We've moved!**

This site is now read-only. You can find our new documentation site and support forum for posting questions here.

Be sure to read our welcome blog!

# HC step 4: Assigning per-sample genotypes

This document describes the procedure used by HaplotypeCaller to assign genotypes to individual samples based on the allele likelihoods calculated in the previous step. For more context information on how this fits into the overall HaplotypeCaller method, please see the more general HaplotypeCaller documentation. See also the documentation on the QUAL score as well as PL and GQ.

Note that this describes the **regular mode** of HaplotypeCaller, which does not emit an estimate of reference confidence. For details on how the reference confidence model works and is applied in `-ERC`

modes (`GVCF`

and `BP_RESOLUTION`

) please see the reference confidence model documentation.

### Overview

The previous step produced a table of per-read allele likelihoods for each candidate variant site under consideration. Now, all that remains to do is to evaluate those likelihoods in aggregate to determine what is the most likely genotype of the sample at each site. This is done by applying Bayes' theorem to calculate the likelihoods of each possible genotype, and selecting the most likely. This produces a genotype call as well as the calculation of various metrics that will be annotated in the output VCF if a variant call is emitted.

### 1. Preliminary assumptions / limitations

#### Quality

Keep in mind that we are trying to infer the genotype of each sample given the observed sequence data, so the degree of confidence we can have in a genotype depends on both the quality and the quantity of the available data. By definition, low coverage and low quality will both lead to lower confidence calls. The GATK only uses reads that satisfy certain mapping quality thresholds, and only uses “good” bases that satisfy certain base quality thresholds (see documentation for default values).

#### Ploidy

Both the HaplotypeCaller and GenotypeGVCFs (but not UnifiedGenotyper) assume that the organism of study is diploid by default, but desired ploidy can be set using the `-ploidy`

argument. The ploidy is taken into account in the mathematical development of the Bayesian calculation. The generalized form of the genotyping algorithm that can handle ploidies other than 2 is available as of version 3.3-0. Note that using ploidy for pooled experiments is subject to some practical limitations due to the number of possible combinations resulting from the interaction between ploidy and the number of alternate alleles that are considered (currently, the maximum "workable" ploidy is ~20 for a max number of alt alleles = 6). Future developments will aim to mitigate those limitations.

#### Paired end reads

Reads that are mates in the same pair are not handled together in the reassembly, but if they overlap, there is some special handling to ensure they are not counted as independent observations.

#### Single-sample vs multi-sample

We apply different genotyping models when genotyping a single sample as opposed to multiple samples together (as done by HaplotypeCaller on multiple inputs or GenotypeGVCFs on multiple GVCFs). The multi-sample case is not currently documented for the public but is an extension of previous work by Heng Li and others.

### 2. Calculating genotype likelihoods using Bayes' Theorem

We use the approach described in Li 2011 to calculate the posterior probabilities of non-reference alleles (Methods 2.3.5 and 2.3.6) extended to handle multi-allelic variation.

The basic formula we use for all types of variation under consideration (SNPs, insertions and deletions) is:

$$ P(G|D) = \frac{ P(G) P(D|G) }{ \sum_{i} P(G_i) P(D|G_i) } $$

If that is meaningless to you, please don't freak out -- we're going to break it down and go through all the components one by one. First of all, the term on the left:

$$ P(G|D) $$

is the quantity we are trying to calculate for each possible genotype: the conditional probability of the genotype **G** given the observed data **D**.

Now let's break down the term on the right:

$$ \frac{ P(G) P(D|G) }{ \sum_{i} P(G_i) P(D|G_i) } $$

We can ignore the denominator (bottom of the fraction) because it ends up being the same for all the genotypes, and the point of calculating this likelihood is to determine the most likely genotype. The important part is the numerator (top of the fraction):

$$ P(G) P(D|G) $$

which is composed of two things: the prior probability of the genotype and the conditional probability of the data given the genotype.

The first one is the easiest to understand. The prior probability of the genotype **G**:

$$ P(G) $$

represents how probably we expect to see this genotype based on previous observations, studies of the population, and so on. By default, the GATK tools use a flat prior (always the same value) but you can input your own set of priors if you have information about the frequency of certain genotypes in the population you're studying.

The second one is a little trickier to understand if you're not familiar with Bayesian statistics. It is called the conditional probability of the data given the genotype, but what does that mean? Assuming that the genotype **G** is the true genotype,

$$ P(D|G) $$

is the probability of observing the sequence data that we have in hand. That is, how likely would we be to pull out a read with a particular sequence from an individual that has this particular genotype? We don't have that number yet, so this requires a little more calculation, using the following formula:

$$ P(D|G) = \prod{j} \left( \frac{P(D_j | H_1)}{2} + \frac{P(D_j | H_2)}{2} \right) $$

You'll notice that this is where the diploid assumption comes into play, since here we decomposed the genotype **G** into:

$$ G = H_1H_2 $$

which allows for exactly two possible haplotypes. In future versions we'll have a generalized form of this that will allow for any number of haplotypes.

Now, back to our calculation, what's left to figure out is this:

$$ P(D_j|H_n) $$

which as it turns out is the conditional probability of the data given a particular haplotype (or specifically, a particular allele), aggregated over all supporting reads. Conveniently, that is exactly what we calculated in Step 3 of the HaplotypeCaller process, when we used the PairHMM to produce the likelihoods of each read against each haplotype, and then marginalized them to find the likelihoods of each read for each allele under consideration. So all we have to do at this point is plug the values from that table into the equation above, and we can work our way back up to obtain:

$$ P(G|D) $$

for the genotype **G**.

### 3. Selecting a genotype and emitting the call record

We go through the process of calculating a likelihood for each possible genotype based on the alleles that were observed at the site, considering every possible combination of alleles. For example, if we see an A and a T at a site, the possible genotypes are AA, AT and TT, and we end up with 3 corresponding probabilities. We pick the largest one, which corresponds to the most likely genotype, and assign that to the sample.

Note that depending on the variant calling options specified in the command-line, we may only emit records for actual variant sites (where at least one sample has a genotype other than homozygous-reference) or we may also emit records for reference sites. The latter is discussed in the reference confidence model documentation.

Assuming that we have a non-ref genotype, all that remains is to calculate the various site-level and genotype-level metrics that will be emitted as annotations in the variant record, including QUAL as well as PL and GQ -- see the linked docs for details. For more information on how the other variant context metrics are calculated, please see the corresponding variant annotations documentation.

## Comments

Hi Sheila,

I have a question concerning the estimation of the genotypes. You said the GATK tools use a flat prior that is always the same value, which is the default value? 1/3, 1/3 and 1/3?

Thanks in advance!

@gonzalez

Hi,

Have a look at this thread

-Sheila

Hi,

I would like ask how is the data of other samples used to support the genotyping of one sample in a multisample case (GenotypeGVCFs)? I see very low AD values (relative to the other allele ADs and expected frequency in a ploidy 4 case) of alleles that get into very confidently assigned genotype (high GQ) in a multisample case and I'm wondering if this is because support is provided from other samples. I understand that AD include even filtered reads, but I see this also in cases where ADs add up to DP.

Thanks in advance!

@pjouhten

Hi,

Basically, the PL fields are used in determining genotypes. It is possible for samples that have low PLs/low GQ to be genotyped as variant (in the final VCF) because the other samples show evidence for a variant allele.

For example, if a diploid organism has no alternate allele in a GVCF and 0,10,20 PLs, the tool is not sure whether there is a variant there or not. That is why the NON-REF allele is so important. It serves as a single allele that captures all possible variation at the site and allows us to calculate PL values for possible variant genotypes at the site without actually calling a single variant allele. In the joint genotyping step, all the sample PLs are used and the prior is re-calculated for the site. The more you see a variant allele in a cohort, the higher the prior becomes. When the PLs are recalculated with the new priors (the NON-REF PLs are used for the other variant alleles seen in the cohort), the genotypes could change.

I hope that makes sense.

-Sheila

Hi,

I would really appreciate if you could make a worked example of this calculation?

I have searched for several hours for a worked example and cannot find any. I intend to use the PL for hard-filtering and I really need to understand the math behind it.

Maybe Just a simple example for one diploid variant site. And also if you can write the prior you use.

Hope you can help.

-Chris_M