We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!
Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
How MuTect filters candidate mutations

Please note that this article refers to the original standalone version of MuTect. A new version is now available within GATK (starting at GATK 3.5) under the name MuTect2. This new version is able to call both SNPs and indels. See the GATK version 3.5 release notes and the MuTect2 tool documentation for further details.
Overview
This document describes the methodological underpinnings of the filters that MuTect applies by default to distinguish real mutations from sequencing artifacts and errors. Some of these filters are applied in all detection modes, while others are only applied in "High Confidence" detection mode.
Note that at the moment, there is no straightforward way to disable these filters. It is possible to disable each by passing parameter values that render the filters ineffective (e.g. set a value of zero for a filter that requires a minimum value of some quantity) but this has to be examined on a case-by-case basis. A more practical solution is to leave the filter parameters untouched, but instead perform some filtering on the CALLSTATS file using text processing functions (e.g. test for lines that have REJECT in only one of several columns).
Filters used in high-confidence mode
1. Proximal Gap
This filter removes false positives (FP) caused by nearby misaligned small indel events. MuTect will reject a candidate site if there are more than a given number of reads with insertions/deletions in an 11 base pair window centered on the candidate. The threshold value is controlled by the --gap_events_threshold
.
In the CALLSTATS output file, the relevant columns are labeled t_ins_count
and t_del_count
.
2. Poor Mapping
This filter removes FPs caused by reads that are poorly mapped (typically due to sequence similarities between different portions of the genome). The filter uses two tests:
Reject candidate if it does not meet a given threshold for the fraction of reads that have a mapping quality of 0 in tumor and normal samples. The threshold value is controlled by
--fraction_mapq_threshold
.Reject candidate if it does not have at least one observation of the mutant allele with a mapping quality that satisfies a given threshold. The threshold value is controlled by
--required_maximum_alt_allele_mapping_quality_score
.
In the CALLSTATS output file, the relevant columns are labeled total_reads
and map_Q0_reads
for the first test, and t_alt_max_mapq
for the second test.
3. Strand Bias
This filter rejects FPs caused by context-specific sequencing where the vast majority of alternate alleles are seen in a single direction of reads. Candidates are rejected if strand-specific LOD is below a given threshold in a direction where the sensitivity to have passed that threshold is above a certain percentage. The LOD threshold value is controlled by --strand_artifact_lod
and the percentage is controlled by --strand_artifact_power_threshold
.
In the CALLSTATS output file, the relevant columns are labeled power_to_detect_negative_strand_artifact
and t_lod_fstar_forward
. There are also complementary columns labeled power_to_detect_positive_strand_artifact
and t_lod_fstar_reverse
.
4. Clustered Position
This filter rejects FPs caused by misalignments evidenced by the alternate alleles being clustered at a consistent distance from the start or end of the read alignment. Candidates are rejected if their median distance from the start/end of the read and median absolute deviation are lower or equal to given thresholds. The position from end of read threshold value is controlled by --pir_median_threshold
and the deviation value is controlled by --pir_mad_threshold
.
In the CALLSTATS output file, the relevant columns are labeled tumor_alt_fpir_median
and tumor_alt_fpir_mad
for the forward strand, and complementary columns are labeled tumor_alt_rpir_median
and tumor_alt_rpir_mad
for the reverse (note the name difference is fpir
vs. rpir
, for forward vs. reverse position in read).
5. Observed in Control
This filter rejects FPs in tumor data by looking at control data (typically from a matched normal) for evidence of the alternate allele that is above random sequencing error. Candidates are rejected if both the following conditions are met:
The number of observations of the alternate allele or the proportion of reads carrying the alternate allele is above a given threshold, controlled by
--max_alt_alleles_in_normal_count
and--max_alt_allele_in_normal_fraction
.The sum of quality scores is above a given threshold value, controlled by
--max_alt_alleles_in_normal_qscore_sum
.
In the CALLSTATS output file, the relevant columns are labeled n_alt_count
, normal_f
, and n_alt_sum
.
Filters applied in all MuTect modes
1. Tumor and normal LOD scores
This filter rejects candidates with a tumor LOD score below a given threshold value, controlled by --tumor_lod
, and similarly for a normal LOD score threshold controlled by --normal_lod_threshold
.
In the CALLSTATS output file, the relevant columns are labeled t_lod_fstar
and init_n_lod
, respectively.
2. Possible contamination
This filter rejects candidates with potential cross-patient contamination, controlled by --fraction_contamination
.
In the CALLSTATS output file, the relevant columns are labeled t_lod_fstar
and contaminant_lod
.
3. Normal LOD score and dbsnp status
If a candidate mutation is in dbsnp but is not in COSMIC, it may be a germline variant. In that case, the normal LOD threshold that the candidate must clear is raised to a value controlled by --dbsnp_normal_lod
.
In the CALLSTATS output file, the relevant column is labeled init_n_lod
.
4. Triallelic Site Filter
When the program is evaluating a site, it considers all possible alternate alleles as mutation candidates, and puts them through all the filters detailed above. If more than one candidate allele passes all filters, resulting in a proposed triallelic site, the site is rejected with the reason triallelic_site
because it is extremely unlikely that this would really happen in a tumor sample.
Comments
Hi,
I have a few questions about MuTect output and filtering:
1) I was wondering if you can elaborate a bit more on how exactly the strand bias filter is implemented?
You mentioned several values that are relevant to strand bias, namely:
But I don't see these columns in the outfile. Is there some flags I need to specify to get them in the output file?
2) Is HC detection mode enabled by default, or do I need to specifically enable it? Are the variants marked as KEEP or REJECT in the output file passed through just the standard detection or through the HC detection?
3) Here you have mentioned several options to control filter parameters, e.g.:
--gap_events_threshold
--dbsnp_normal_lod
--strand_artifact_lod
Are they called at run time?
They also don't seem to be documented elsewhere. Is there a full documentation on all the available parameters/arguments?
Many thanks,
Paul
Hi Paul,
Try adding
--enable_extended_output
to your command.MuTect runs in high-confidence mode by default assuming you are providing both a tumor and a normal.
You can set their values in your command line, if that's what you mean. We don't currently have a proper document listing the arguments but they are fairly easy to read on this page of the code repository: MuTectArgumentCollection.java
Thanks!
Could you tell me how to filter out oxidation events as a result of shearing using MuTect as described in Costello et al 2012 (dx.doi.org/10.1093/nar/gks1443)? Thanks!
I believe this is done using another program outside of MuTect; the software is available at http://broadinstitute.org/cancer/cga/dtoxog
Great, thank you very much!
Hi Geraldine,
I have a few questions about MuTect output and filtering:
could you give me the several options to control filter parameters ? I wonder the reason why the candidate reject?
failure_reasons as follows:
1) fstar_tumor_lod
2) possible_contamination
3) normal_lod
4) alt_allele_in_normal
5) poor_mapping_region_alternate_allele_mapq
I couldn't find them in the document and links ( MuTectArgumentCollection.java ) you given.
I try to add --enable_extended_output to my command, but it returned lots of error information:
java -Xmx4g -jar /public/apps/mutect/1.1.7/java.1.7.0_67/mutect-1.1.7.jar --analysis_type MuTect --enable_extended_output
INFO 17:01:19,267 HelpFormatter - --------------------------------------------------------------------------------
INFO 17:01:19,328 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.1-0-g72492bb, Compiled 2015/01/21 17:10:56
INFO 17:01:19,328 HelpFormatter - Copyright (c) 2010 The Broad Institute
INFO 17:01:19,329 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk
INFO 17:01:19,335 HelpFormatter - Program Args: --analysis_type MuTect --enable_extended_output
INFO 17:01:19,358 HelpFormatter - Executing as [email protected] on Linux 2.6.32-504.1.3.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_67-b01.
INFO 17:01:19,358 HelpFormatter - Date/Time: 2015/12/04 17:01:19
INFO 17:01:19,359 HelpFormatter - --------------------------------------------------------------------------------
INFO 17:01:19,359 HelpFormatter - --------------------------------------------------------------------------------
INFO 17:01:20,750 GenomeAnalysisEngine - Strictness is SILENT
ERROR ------------------------------------------------------------------------------------------
ERROR A USER ERROR has occurred (version 3.1-0-g72492bb):
ERROR
ERROR This means that one or more arguments or inputs in your command are incorrect.
ERROR The error message below tells you what is the problem.
ERROR
ERROR If the problem is an invalid argument, please check the online documentation guide
ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
ERROR
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
ERROR
ERROR MESSAGE: Walker requires a reference but none was provided.
ERROR ------------------------------------------------------------------------------------------
thanks a lot,
Cherry
Sorry for the late reply. Your problem is that your command line is incomplete. There are several arguments missing.
Hello Geraldine,
Sorry for the naive question but I am a little bit confused with the "--required_maximum_alt_allele_mapping_quality_score" parameter. Shouldn't it be the threshold for the minimum required mapping quality ensuring that there is "at least one observation of the mutant allele" with a decent mapping quality?.. Otherwise I can't make sense out of it.
Thank you!
Eugenie
And one more question concerning filter parameters. In Mutect-1.1.4 there was a "--clipping_bias_pvalue_threshold" argument (defined as "pvalue threshold for fishers exact test of clipping bias in mutant reads vs ref reads"), with a default value of 0.05. However, when using Mutect-1.1.7 I get "ERROR MESSAGE: Argument with name 'clipping_bias_pvalue_threshold' isn't defined".
Was this argument discarded or renamed (I don't see anything similar in the Mutect-1.1.7 vcf header)?
The reason I am interested is that I would like to disable it.
So if the filter is discarded the problem is solved.
I just want to make sure that the only argument related to soft/hard clipping left is "--heavily_clipped_read_fraction".
(I have an example when all variant reads are soft clipped but I should probably post this issue separately)
Thank you!
Eugenie
Thanks.
Riccardo
@RikyJKD
Hi, wouldn't the raw output contain them anyway? If you don't use "only_passing_calls" argument they will be in your output file, marked as "REJECT" with a failure reason like germline_risk/normal_lod/alt_allele_in_normal etc.
But I would also like to know what exactly are all these candidates.
I would assume that initially every position with at least 1 variant read (quality filters applied?) is considered and then germline and artifacts are filtered. (Otherwise I can't explain where such a huge number of candidates comes from).
@Geraldine, please can you comment on it?
Thank you!
Eugenie
Hi thank you for the reply. I think that the best solution is to consider as KEEP all the REJECT that did not pass the filter the filter alt_allele_in_normal, do you agree?
Thank you.
Riccardo.
Sorry, I try to explain me better, I would to consider as KEEP the mutation also detected in the normal because my normal is also a tumor. Do I have to consider all the REJECT that not pass alt_allele_in_normal or also germline_risk and normal_lod?
Thank you.
Riccardo
Hi,
I am not sure I understand it correctly: are you calling tumors without normal samples or?..
I can't imagine why would you call two tumors together.
And are you looking to retrieve real germline variants or somatic variants present in another tumor sample?
I didn't test it myself but I would assume that by rescuing candidates marked with "alt_allele_in_normal" you should get most germline variants. However as you can imagine this will also include artifacts (eg coming from misalignments, where only a few variant reads are present in all samples). So you would probably like to filter them further.
Btw, the alt_allele_in_normal filter is something you can change with "--max_alt_alleles_in_normal_count" argument (1 by default). Will it solve the problem maybe?
In your opinion is correct my approach?
Can I also consider germline_risk and norma_lod?
Thank you.
Riccardo
Ah, I see, that's an interesting combination.
But do you also have a normal sample for these tumors?
I didn't have such projects myself but I would suggest you call tumors separately.
Then you will have 2 sets of somatic mutations and you can do all kinds of filtering and intersections.
I think filtering 2 sets of mutations (eg with GATK SelectVariants tool) is easier than trying to separate rejected candidates which represent a mix of true germline and somatic.
That's what I was doing when dealing with more than one tumor per patient.
Let me know if understood you correctly and made myself clear.
Best,
Eugenie
Thank you.
Riccardo
I see.
By "germline" you mean present in another cancer sample?
I would think alt_allele_in_normal is enough.
But then I am not sure how are you going to separate SNVs from real germline: by allele frequency? (if the tumor content isn't high) filtering against SNPs from public datasets? (wouldn't recommend dbSNP as it has quite some somatic variants nowadays)
Best,
Eugenie
Yes for germline in this case I mean the mutations present in the cancer sample used as normal. I would to consider the allele frequency and dbSNP and thank you for the information of its contents in somatic variants, what databases do you advise instead of dbSNP database in this case?
Thanks.
Riccardo
Well...1000 genomes is an obvious option but whether it will be helpful depends on the population you are working with.
Do you have access to any normal samples?
The best option would be to create a "panel of normals" using normal samples from the same population which were ideally processed in the same way (have a look at some related threads, eg http://gatkforums.broadinstitute.org/gatk/discussion/6904/panel-of-normals-for-mutect#latest).
So even if you don't have matched normal samples maybe you have access to some normal samples sequenced in the same center - that would be of great help.
Thank you but I do not have any normal samples.
Thanks.
Riccardo
Hi,Dear Geraldine:
what is the recommended value of parameters --strand_artifact_lod and --strand_artifact_power_threshold if I want to filter strand bias FPs? should I use these two parameters at the same time?
We have used GATK4 Mutect 2 to call somatic mutations from a customised NGS panel. However, we have problems regarding the mutant-allele percentage given by Mutect2. As a matter of fact, a patient harboring the V600E mutation in BRAF determined both by Sanger and a commercial NGS panel (Tumor 15, from Illumina), was found to have a % of mutant-allele of 0.4% after performing Mutect2. Moreover, this percentage does not fit with what we observed in the merged tumor bam files (around 24% of mutant-allele clone).
Could you please help us to tackle this problem?
Thank you so much.