Normallod and tumorlod in Mutect2
Hello,
What are the normallod and tumorlod in Mutect2?
What are the bases of the default values 2.2 and 3.0 respectively?
And, if the threshold are lowered or raised, what would be happened in terms of mutation calling?
Please, explain them by using very basic terms ( AD, AF, ...)
Many thanks,
Luke
Best Answers

davidben Boston ✭✭✭
Whoops, I dropped the ball. The lods are log10 likelihood ratios i.e. a normal lod of 4 means the reads support a homref hypothesis for the normal by a factor of 10^4. The default thresholds are just an empirical compromise between speed, sensitivty, and precision. We are working on a better model that lets the user choose a maximum acceptable false discovery rate.

davidben Boston ✭✭✭
@manba These questions are about probability and statistics and not really specific to the GATK. I am are very fond of Larry Wasserman's book "All of Statistics," which is more introductory than the daunting title would suggest. Others in the group have had good experiences with the MIT OpenCoureWare introduction to probability and statistics.
We make our documentation as selfcontained as possible, but I hope you will understand that we need to assume some mathematical prerequisites
Answers
@lukeheo
Hi Luke,
I am checking with the developer and will get back to you.
Sheila
Whoops, I dropped the ball. The lods are log10 likelihood ratios i.e. a normal lod of 4 means the reads support a homref hypothesis for the normal by a factor of 10^4. The default thresholds are just an empirical compromise between speed, sensitivty, and precision. We are working on a better model that lets the user choose a maximum acceptable false discovery rate.
The likelihood of a hypothesis is defined as the probability of the observed data (reads) given that the hypothesis is true. It's different from the probability because the hypothesis fit the data well but be a priori not probable. For example, if it's dark outside the hypothesis that sunblocking aliens came to Earth is quite high (assuming that these aliens are always in the habit of blocking the sun) because it explains the data well, despite being outlandish.
Anyway, the NLOD is the log 10 of the following likelihood ratio (Here "P" means probability and "" means "given that"):
P(reads  normal is hom ref ie has no mutation) / P(reads  normal is het ie has the mutation)
You could approximately think of these likelihoods as coming from a binomial model, where if we have k alt reads out of n total reads and have a base error rate of e, then
P(reads  hom ref) = Binomial(k  n, e) (alt reads are due to error)
and
P(reads  het) = Binomial(k  n, 1/2) (alt reads are real and diploid)
The results of Mutect2 are fairly insensitive to the threshold because given a modest depth of coverage in the normal the NLOD is usually overwhelming one way or the other, since diploid het calling is generally easy.
I think we will be much appreciated if there a detailed document to explain the threshold for gatk4 Mutect2's vcf FILTER column. For example, when the variant was flagged as "cluster_event", I searched the forum by "cluster_event", such as https://gatkforums.broadinstitute.org/gatk/discussion/9985/gatk4betaclusteredeventsinmutect2filtermutectcalls and I also have read the mathematical notes no Mutect PDF, but still I could not get a confirmed answer when it is flagged.
@xiucz The answer to this and many other questions can be found in the docs we maintain in the GATK repo on github: https://github.com/broadinstitute/gatk/blob/master/docs/mutect/mutect.pdf. In particular, you may find Section 8: Mutect Filters and the table therein to be helpful.
For example, in
P(reads  normal is hom ref)
, the reads are the data and "normal is hom ref" is the hypothesis.Hi @manba  Here is an opportunity to learn about those icky statistics!!! There is a website found here that provides a basic rundown of how the likelihood and probabilities are generated when analyzing the data.
When reading the results, it is best not to think in concrete numbers of things, but the probability spaced defined by your observations. For example, if I tell you that the likelihood of hitting a winning lottery ticket is 1 in 1 million, it does not mean that I found exactly 1 winning ticket and 999,999,999 losing tickets. It just means that in all of the lotteries of this type, on average, one would not expect to discover more than one winning ticket per million, where in fact there could be 5 or 10 or 50 or none. So, in this case, there are not 10000 actual reads that support or do not support the hypothesis.
Check out the tutorials at that link, I think you may find it helpful for using GATK.
@manba These questions are about probability and statistics and not really specific to the GATK. I am are very fond of Larry Wasserman's book "All of Statistics," which is more introductory than the daunting title would suggest. Others in the group have had good experiences with the MIT OpenCoureWare introduction to probability and statistics.
We make our documentation as selfcontained as possible, but I hope you will understand that we need to assume some mathematical prerequisites