This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!
Likelihoods and Probabilities
There are several instances in the GATK documentation where you will encounter the terms "likelihood" and "probability", because key tools in the variant discovery workflows rely heavily on Bayesian statistics. For example, the HaplotypeCaller, our most prominent germline SNP and indel caller, uses Bayesian statistics to determine genotypes.
So what do likelihood and probability mean and how are they related to each other in the Bayesian context?
Bayes' rule states that
where the bit we care about most, P(D|H), is the probability of observing D given the hypothesis H. This can also be formulated as L(H|D), i.e. the likelihood of the hypothesis H given the observation D:
We use the term likelihood instead of probability to describe the term on the right because we cannot calculate a meaningful probability distribution on a hypothesis, which by definition is binary (it will either be true or false) -- but we can determine the likelihood that a hypothesis is true or false given a set of observations. For a more detailed explanation of these concepts, please see this lesson.
Now you may wonder, what about the posterior probability P(H|D) that we eventually calculate through Bayes' rule? Isn't that a "probability of a hypothesis"? Well yes; in Bayesian statistics, we can calculate a posterior probability distribution on a hypothesis, because its probability distribution is relative to all of the other competing hypotheses (http://www.smbc-comics.com/index.php?id=4127). Tadaa.
See this HaplotypeCaller doc article for a worked out explanation of how we calculate and use genotype likelihoods in germline variant calling.
So always remember this, if nothing else: the terms likelihood and probability are not interchangeable in the Bayesian context, even though they are often used interchangeably in common English.