#### Test-drive the GATK tools and Best Practices pipelines on Terra

**Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.**

# Likelihoods and Probabilities

There are several instances in the GATK documentation where you will encounter the terms "likelihood" and "probability", because key tools in the variant discovery workflow rely heavily on Bayesian statistics. For example, the HaplotypeCaller, our most prominent germline SNP and indel caller, uses Bayesian statistics to determine genotypes.

#### So what do likelihood and probability mean and how are they related to each other in the Bayesian context?

In Bayesian statistics (as opposed to frequentist statistics), we are typically trying to evaluate the posterior probability of a hypothesis (H) based on a series of observations (data, D).

**Bayes' rule** states that

$${P(H|D)}=\frac{P(H)P(D|H)}{P(D)}$$

where the bit we care about most, **P(D|H)**, is the **probability of observing D given the hypothesis H**. This can also be formulated as **L(H|D)**, i.e. the **likelihood of the hypothesis H given the observation D**:

$$P(D|H)=L(H|D)$$

We use the term **likelihood** instead of **probability** to describe the term on the right because we cannot calculate a meaningful probability distribution on a hypothesis, which by definition is binary (it will either be true or false) -- but we *can* determine the likelihood that a hypothesis is true or false given a set of observations. For a more detailed explanation of these concepts, please see the following lesson (http://ocw.mit.edu/courses/mathematics/18-05-introduction-to-probability-and-statistics-spring-2014/readings/MIT18_05S14_Reading11.pdf).

Now you may wonder, what about the posterior probability P(H|D) that we eventually calculate through Bayes' rule? Isn't that a "probability of a hypothesis"? Well yes; in Bayesian statistics, we *can* calculate a *posterior* probability distribution on a hypothesis, because its probability distribution is *relative* to all of the other competing hypotheses (http://www.smbc-comics.com/index.php?id=4127). Tadaa.

See this HaplotypeCaller doc article for a worked out explanation of how we calculate and use genotype likelihoods in germline variant calling.

So always remember this, if nothing else: the terms likelihood and probability are *not* interchangeable in the Bayesian context, even though they are often used interchangeably in common English.

A special thanks to Jon M. Bloom PhD (MIT) for his assistance in the preparation of this article.