# How MuTect identifies candidate somatic mutations

edited December 2015

Please note that this article refers to the original standalone version of MuTect. A new version is now available within GATK (starting at GATK 3.5) under the name MuTect2. This new version is able to call both SNPs and indels. See the GATK version 3.5 release notes and the MuTect2 tool documentation for further details.

### Overview

In a nutshell, the MuTect analysis consists of three steps:

1. Pre-processing the aligned reads in the tumor and normal sequencing data
2. Statistical analysis to identify sites that are likely to carry somatic mutations with high confidence
3. Post-processing of candidate somatic mutations

This document summarizes the key points of these three steps. For complete details, please see the 2013 publication in Nature Biotechnology:

Cibulskis, K. et al. Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples. Nat Biotechnology (2013).doi:10.1038/nbt.2514

### 1. Pre-processing the aligned reads in the tumor and normal sequencing data

In this step we ignore reads with too many mismatches or very low quality scores since these represent noisy reads that introduce more noise than signal.

### 2. Statistical analysis to identify sites that are likely to carry somatic mutations with high confidence

The statistical analysis predicts a somatic mutation by using two Bayesian classifiers – the first aims to detect whether the tumor is non-reference at a given site and, for those sites that are found as non-reference, the second classifier makes sure the normal does not carry the variant allele. In practice the classification is performed by calculating a LOD score (log odds) and comparing it to a cutoff determined by the log ratio of prior probabilities of the considered events.

For the tumors we calculate:

$$LOD_T = log_{10} \left ( \frac{ P( \text{observed data in tumor | site is mutated} ) } { P( \text{observed data in tumor | site is reference} ) } \right )$$

And for the normal:

$$LOD_N = log_{10} \left ( \frac{ P( \text{observed data in normal | site is reference} ) } { P( \text{observed data in normal | site is mutated} ) } \right )$$

Since we expect somatic mutations to occur at a rate of ~1 per Mb, we require

$$LOD_T > log_{10} (0.5 \times 10^{-6} ) \approx 6.3$$

which guarantees that our false positive rate, due to noise in the tumor, is less than half of the somatic mutation rate.

In the normal, for sites that are not in dbSNP, we require

$$LOD_N > log_{10} (0.5 \times 10^{-2} ) \approx 2.3$$

since non-dbSNP germline variants occur roughly at a rate of 100 per Mb. This cutoff guarantees that the false positive somatic call rate, due to missing the variant in the normal, is also less than half the somatic mutation rate.

### 3. Post-processing of candidate somatic mutations

This step aims to eliminate artifacts of next-generation sequencing, short read alignment and hybrid capture. For example, sequence context can cause hallucinated alternate alleles but often only in a single direction. Therefore, we test that the alternate alleles supporting the mutations are observed in both directions.

### Note on method validation

Most cancer genome studies at the Broad Institute have made use of MuTect and have validated the mutation calls as a part of their cancer biology papers, showing that MuTect has a very low false positive rate. A summary of validation rates from these papers are show below:

Post edited by Geraldine_VdAuwera on
Tagged:

• beijingUnconfirmed, Member

Dear Geraldine,
Thanks a lot for the documentation which helps a lot. I have a question regarding to running MuTect.
I noticed in this doc the Normal log threshold is 2.3, but in the Mutect source code
public float NORMAL_LOD_THRESHOLD = 2.2f
Would you please tell me the reason.

• Member

Hi Geraldine,
I would like to compare the variant frequency at each nucleotide site for a given exon without using a matched normal sample. The idea being that any variant frequency that is significantly different from the background sequencing error frequency would be called. Are there files from MuTect that report the background error frequency at each nucleotide for a given specimen on a given platform, or is variant calling restricted to matched pairs?