**Notice:**

If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We appreciate your help!

#### Test-drive the GATK tools and Best Practices pipelines on Terra

**Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.**

# Understanding the math behind MUTECT LOD scores

Jenkinson
Member ❠

Hello, thank you for making such great tools and having such a responsive forum for answering questions about your methods. My question is what are the LOD scores exactly and how are they computed (and thus how they should be interpreted) in Mutect/Mutect2? My background is more mathematical/statistical and thus I am trying to follow in full detail the mathematics of the paper, which is the only way I feel I will fully understand any given method.

Unfortunately, it appears that the Nature Biotechnology paper where the details are spelled out may have typos/errors in the "Methods: Variant Detection" section, which make it difficult to track how LOD_T and LOD_N are defined/used in MUTECT. In particular, if you look at the two LOD_T equations, there are two conflicting definitions/equalities for LOD_T(m,f). These two equations would only agree if P(m,f)=0.5, which is not an assumption I believe the package is making. Likewise, the LOD_N inequalities further down in "Methods: Variant Classification" are not consistent; these inequalities would only be equivalent if P(m,f)=P(germ line), which is again an assumption that I do not believe the method is making.

Can someone clarify, in the notation of that paper, what these equations should be to align with the actual implementation in Mutect? And is it safe to say that the same interpretation/logic carries over into Mutect2's TLOD score?

I did search the forums and see the other post here:

(I cannot post links as I made a new account, but there is a forum post thats link ends /gatk/discussion/4463/how-mutect-identifies-candidate-somatic-mutations)

But the notation there is somewhat ambiguous and does not match the more rigorous notation of the Mutect paper, making it difficult for me to track what it means exactly, and I also see what appear to be errors in that post as well, for example:

LOD_T > log_{10} (0.5 \times 10^{-6} ) \approx 6.3

But I believe that log_{10} (0.5\times 10^{-6}) is approximately equal to negative 6.3, not positive 6.3 as is written. But that positive 6.3 threshold also appears in the mutect methods section as well.

I would greatly appreciate any clarity that can be provided regarding the details of the math behind these LOD scores as they are actually used/implemented in Mutect.

Thank you very much!

Garrett

Unfortunately, it appears that the Nature Biotechnology paper where the details are spelled out may have typos/errors in the "Methods: Variant Detection" section, which make it difficult to track how LOD_T and LOD_N are defined/used in MUTECT. In particular, if you look at the two LOD_T equations, there are two conflicting definitions/equalities for LOD_T(m,f). These two equations would only agree if P(m,f)=0.5, which is not an assumption I believe the package is making. Likewise, the LOD_N inequalities further down in "Methods: Variant Classification" are not consistent; these inequalities would only be equivalent if P(m,f)=P(germ line), which is again an assumption that I do not believe the method is making.

Can someone clarify, in the notation of that paper, what these equations should be to align with the actual implementation in Mutect? And is it safe to say that the same interpretation/logic carries over into Mutect2's TLOD score?

I did search the forums and see the other post here:

(I cannot post links as I made a new account, but there is a forum post thats link ends /gatk/discussion/4463/how-mutect-identifies-candidate-somatic-mutations)

But the notation there is somewhat ambiguous and does not match the more rigorous notation of the Mutect paper, making it difficult for me to track what it means exactly, and I also see what appear to be errors in that post as well, for example:

LOD_T > log_{10} (0.5 \times 10^{-6} ) \approx 6.3

But I believe that log_{10} (0.5\times 10^{-6}) is approximately equal to negative 6.3, not positive 6.3 as is written. But that positive 6.3 threshold also appears in the mutect methods section as well.

I would greatly appreciate any clarity that can be provided regarding the details of the math behind these LOD scores as they are actually used/implemented in Mutect.

Thank you very much!

Garrett

## Answers

@Jenkinson For Mutect2, please refer to the Somatic Likelihoods Model section of our notes: https://github.com/broadinstitute/gatk/blob/2ef54c69b1523380f3780b98d8a971d3652f86bc/docs/mutect/mutect.pdf

Mutect1's model is completely different and much simpler, and while we're familiar with it we're not familiar enough to be able to answer your questions in a reasonable amount of time, unfortunately.