What input files does MuTect accept / require?
Please note that this article refers to the original standalone version of MuTect. A new version is now available within GATK (starting at GATK 3.5) under the name MuTect2. This new version is able to call both SNPs and indels. See the GATK version 3.5 release notes and the MuTect2 tool documentation for further details.
All analyses done with MuTect typically involve several (though not necessarily all) of the following inputs:
- Reference genome sequence
- Sequencing reads for normal tissue and tumor tissue (normal/tumor data)
- Intervals of interest
- COSMIC data
- Panel of normals
Since MuTect is based on GATK, the general format requirements are the same as those described in the GATK documentation on input files.
Below are the input requirements and/or recommendations that are specific to MuTect.
1. Normal/Tumor data
A key component of the MuTect method involves comparing evidence for variation in a tumor sample against a matched normal sample from the same individual, in order to distinguish somatic mutations from germline mutations. So the Best Practice recommendation is to provide both normal and tumor data from the same individual to MuTect for best results. However, it is possible to run MuTect only on tumor samples without a matched normal. If available, a Panel of Normals (PoN) can be used to represent expected germline variation.
2. COSMIC data
COSMIC stands for Catalog Of Somatic Mutations In Cancer. It is a database of variants that have been found to be implicated in cancer processes, maintained by the Sanger Institute (see project website).
MuTect uses the COSMIC data to whitelist variants that are found in tumor samples, to prevent them from being filtered out if they are also present in dbSNP or a panel of normals.