If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!
Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
Likelihoods and Probabilities
There are several instances in the GATK documentation where you will encounter the terms "likelihood" and "probability", because key tools in the variant discovery workflows rely heavily on Bayesian statistics. For example, the HaplotypeCaller, our most prominent germline SNP and indel caller, uses Bayesian statistics to determine genotypes.
So what do likelihood and probability mean and how are they related to each other in the Bayesian context?
Bayes' rule states that
where the bit we care about most, P(D|H), is the probability of observing D given the hypothesis H. This can also be formulated as L(H|D), i.e. the likelihood of the hypothesis H given the observation D:
We use the term likelihood instead of probability to describe the term on the right because we cannot calculate a meaningful probability distribution on a hypothesis, which by definition is binary (it will either be true or false) -- but we can determine the likelihood that a hypothesis is true or false given a set of observations. For a more detailed explanation of these concepts, please see this lesson.
Now you may wonder, what about the posterior probability P(H|D) that we eventually calculate through Bayes' rule? Isn't that a "probability of a hypothesis"? Well yes; in Bayesian statistics, we can calculate a posterior probability distribution on a hypothesis, because its probability distribution is relative to all of the other competing hypotheses (http://www.smbc-comics.com/index.php?id=4127). Tadaa.
See this HaplotypeCaller doc article for a worked out explanation of how we calculate and use genotype likelihoods in germline variant calling.
So always remember this, if nothing else: the terms likelihood and probability are not interchangeable in the Bayesian context, even though they are often used interchangeably in common English.