Attention: Want an end-to-end pipelining solution for GATK Best Practices?
What is the output of MuTect and how should I interpret it?
Please note that this article refers to the original standalone version of MuTect. A new version is now available within GATK (starting at GATK 3.5) under the name MuTect2. This new version is able to call both SNPs and indels. See the GATK version 3.5 release notes and the MuTect2 tool documentation for further details.
MuTect produces a lot of information that is spread across several different files. This document describes the most important outputs and how to interpret them. For a complete list of outputs and their description, please use the
-help flag at the command line.
* Call-stats file
The main output which people typically work with is the "call-stats" file. It is an exhaustive report of all the metrics and statistics available about the calls made by MuTect and the filters that are applied internally by default. See further below for a more complete description of the call-stats output.
* VCF file of candidate mutations
Upon request, MuTect can output a summary VCF file containing the mutation candidates annotated with
REJECT in the
* Coverage / WIGGLE files
Also upon request, MuTect can output so-called "wiggle" files (in WIGGLE format) that contain useful information about the read coverage observed in the data. This format indicates for every base whether it is sufficiently covered in the tumor and normal to be sensitive enough to call mutations. We currently use cutoffs of at least 14 reads in the tumor and at least 8 in the normal (these cutoffs are applied after removing noisy reads in the preprocessing step). There are several different files that can be generated, containing e.g. overall coverage, just the tumor, just the normal, and so on.
More details about the call-stats file and how to use it
The call-stats output contains a lot of information that is intended to help with development, but that most users don't need to take into account in their analysis. Since this can be rather confusing, we recommend that you extract subsets of information from the call-states file according to your needs, rather than try to work with the whole thing.
Extracting subsets of data using
The most common subset you'll want to work with is the set of confident calls that were not rejected by MuTect's internal filters. An easy way to do this using basic Unix tools is to search for lines that don't contain the string REJECT:
grep -v REJECT <my.call_stats.txt>
You can also select subsets of sites that were filtered for specific reasons, in case you want to "rescue" those sites. This is the equivalent of disabling MuTect's internal filters, which is currently hard to do from command line.
Understanding the main statistics / fields
Here are the definitions of some of the most prominent outputs in the call-stats file:
- contig: the contig location of this candidate
- position: the 1-based position of this candidate on the given contig
- ref_allele: the reference allele for this candidate
- alt_allele: the mutant (alternate) allele for this candidate
- tumor_name: name of the tumor as given on the command line, or extracted from the BAM
- normal_name: name of the normal as given on the command line, or extracted from the BAM
- score: for future development
- dbsnp_site: is this a dbsnp site as defined by the dbsnp bitmask supplied to the caller
- covered: was the site powered to detect a mutation (80% power for a 0.3 allelic fraction mutation)
- power: tumor_power * normal_power
- tumor_power: given the tumor sequencing depth, what is the power to detect a mutation at 0.3 allelic fraction
- normal_power: given the normal sequencing depth, what power did we have to detect (and reject) this as a germline variant
- total_pairs: total tumor and normal read depth which come from paired reads
- improper_pairs: number of reads which have abnormal pairing (orientation and distance)
- map_Q0_reads: total number of mapping quality zero reads in the tumor and normal at this locus
- init_t_lod: deprecated
- t_lod_fstar: CORE STATISTIC: Log of (likelihood tumor event is real / likelihood event is sequencing error )
- tumor_f: allelic fraction of this candidated based on read counts
- contaminant_fraction: estimate of contamination fraction used (supplied or defaulted)
- contaminant_lod: log likelihood of ( event is contamination / event is sequencing error )
- t_ref_count: count of reference alleles in tumor
- t_alt_count: count of alternate alleles in tumor
- t_ref_sum: sum of quality scores of reference alleles in tumor
- t_alt_sum: sum of quality scores of alternate alleles in tumor
- t_ins_count: count of insertion events at this locus in tumor
- t_del_count: count of deletion events at this locus in tumor
- normal_best_gt: most likely genotype in the normal
- init_n_lod: log likelihood of ( normal being reference / normal being altered )
- n_ref_count: count of reference alleles in normal
- n_alt_count: count of alternate alleles in normal
- n_ref_sum: sum of quality scores of reference alleles in normal
- n_alt_sum: sum of quality scores of alternate alleles in normal
- judgement: final judgement of site KEEP or REJECT (not enough evidence or artifact)