Download the latest Picard release at https://github.com/broadinstitute/picard/releases.
GATK version 4.beta.6 is out. See the GATK4 beta page for download and details.

How MuTect filters candidate mutations

Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie
edited December 2015 in MuTect v1 Documentation

Please note that this article refers to the original standalone version of MuTect. A new version is now available within GATK (starting at GATK 3.5) under the name MuTect2. This new version is able to call both SNPs and indels. See the GATK version 3.5 release notes and the MuTect2 tool documentation for further details.

Overview

This document describes the methodological underpinnings of the filters that MuTect applies by default to distinguish real mutations from sequencing artifacts and errors. Some of these filters are applied in all detection modes, while others are only applied in "High Confidence" detection mode.

Note that at the moment, there is no straightforward way to disable these filters. It is possible to disable each by passing parameter values that render the filters ineffective (e.g. set a value of zero for a filter that requires a minimum value of some quantity) but this has to be examined on a case-by-case basis. A more practical solution is to leave the filter parameters untouched, but instead perform some filtering on the CALLSTATS file using text processing functions (e.g. test for lines that have REJECT in only one of several columns).


Filters used in high-confidence mode

1. Proximal Gap

This filter removes false positives (FP) caused by nearby misaligned small indel events. MuTect will reject a candidate site if there are more than a given number of reads with insertions/deletions in an 11 base pair window centered on the candidate. The threshold value is controlled by the --gap_events_threshold.

In the CALLSTATS output file, the relevant columns are labeled t_ins_count and t_del_count.

2. Poor Mapping

This filter removes FPs caused by reads that are poorly mapped (typically due to sequence similarities between different portions of the genome). The filter uses two tests:

  • Reject candidate if it does not meet a given threshold for the fraction of reads that have a mapping quality of 0 in tumor and normal samples. The threshold value is controlled by --fraction_mapq_threshold.

  • Reject candidate if it does not have at least one observation of the mutant allele with a mapping quality that satisfies a given threshold. The threshold value is controlled by --required_maximum_alt_allele_mapping_quality_score.

In the CALLSTATS output file, the relevant columns are labeled total_reads and map_Q0_reads for the first test, and t_alt_max_mapq for the second test.

3. Strand Bias

This filter rejects FPs caused by context-specific sequencing where the vast majority of alternate alleles are seen in a single direction of reads. Candidates are rejected if strand-specific LOD is below a given threshold in a direction where the sensitivity to have passed that threshold is above a certain percentage. The LOD threshold value is controlled by --strand_artifact_lod and the percentage is controlled by --strand_artifact_power_threshold.

In the CALLSTATS output file, the relevant columns are labeled power_to_detect_negative_strand_artifact and t_lod_fstar_forward. There are also complementary columns labeled power_to_detect_positive_strand_artifact and t_lod_fstar_reverse.

4. Clustered Position

This filter rejects FPs caused by misalignments evidenced by the alternate alleles being clustered at a consistent distance from the start or end of the read alignment. Candidates are rejected if their median distance from the start/end of the read and median absolute deviation are lower or equal to given thresholds. The position from end of read threshold value is controlled by --pir_median_threshold and the deviation value is controlled by --pir_mad_threshold.

In the CALLSTATS output file, the relevant columns are labeled tumor_alt_fpir_median and tumor_alt_fpir_mad for the forward strand, and complementary columns are labeled tumor_alt_rpir_median and tumor_alt_rpir_mad for the reverse (note the name difference is fpir vs. rpir, for forward vs. reverse position in read).

5. Observed in Control

This filter rejects FPs in tumor data by looking at control data (typically from a matched normal) for evidence of the alternate allele that is above random sequencing error. Candidates are rejected if both the following conditions are met:

  • The number of observations of the alternate allele or the proportion of reads carrying the alternate allele is above a given threshold, controlled by --max_alt_alleles_in_normal_count and --max_alt_allele_in_normal_fraction.

  • The sum of quality scores is above a given threshold value, controlled by --max_alt_alleles_in_normal_qscore_sum.

In the CALLSTATS output file, the relevant columns are labeled n_alt_count, normal_f , and n_alt_sum.


Filters applied in all MuTect modes

1. Tumor and normal LOD scores

This filter rejects candidates with a tumor LOD score below a given threshold value, controlled by --tumor_lod, and similarly for a normal LOD score threshold controlled by --normal_lod_threshold.

In the CALLSTATS output file, the relevant columns are labeled t_lod_fstar and init_n_lod, respectively.

2. Possible contamination

This filter rejects candidates with potential cross-patient contamination, controlled by --fraction_contamination.

In the CALLSTATS output file, the relevant columns are labeled t_lod_fstar and contaminant_lod.

3. Normal LOD score and dbsnp status

If a candidate mutation is in dbsnp but is not in COSMIC, it may be a germline variant. In that case, the normal LOD threshold that the candidate must clear is raised to a value controlled by --dbsnp_normal_lod.

In the CALLSTATS output file, the relevant column is labeled init_n_lod.

4. Triallelic Site Filter

When the program is evaluating a site, it considers all possible alternate alleles as mutation candidates, and puts them through all the filters detailed above. If more than one candidate allele passes all filters, resulting in a proposed triallelic site, the site is rejected with the reason triallelic_site because it is extremely unlikely that this would really happen in a tumor sample.

Post edited by Geraldine_VdAuwera on
Tagged:

Comments

  • Hi,

    I have a few questions about MuTect output and filtering:

    1) I was wondering if you can elaborate a bit more on how exactly the strand bias filter is implemented?

    You mentioned several values that are relevant to strand bias, namely:

    • power_to_detect_negative_strand_artifact
    • t_lod_fstar_forward
    • power_to_detect_positive_strand_artifact
    • t_lod_fstar_reverse
      But I don't see these columns in the outfile. Is there some flags I need to specify to get them in the output file?

    2) Is HC detection mode enabled by default, or do I need to specifically enable it? Are the variants marked as KEEP or REJECT in the output file passed through just the standard detection or through the HC detection?

    3) Here you have mentioned several options to control filter parameters, e.g.:
    --gap_events_threshold
    --dbsnp_normal_lod
    --strand_artifact_lod

    Are they called at run time?

    They also don't seem to be documented elsewhere. Is there a full documentation on all the available parameters/arguments?

    Many thanks,

    Paul

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hi Paul,

    1. Try adding --enable_extended_output to your command.

    2. MuTect runs in high-confidence mode by default assuming you are providing both a tumor and a normal.

    3. You can set their values in your command line, if that's what you mean. We don't currently have a proper document listing the arguments but they are fairly easy to read on this page of the code repository: MuTectArgumentCollection.java

  • Could you tell me how to filter out oxidation events as a result of shearing using MuTect as described in Costello et al 2012 (dx.doi.org/10.1093/nar/gks1443)? Thanks!

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    I believe this is done using another program outside of MuTect; the software is available at http://broadinstitute.org/cancer/cga/dtoxog

  • Great, thank you very much!

  • xiaoxiaoh16xiaoxiaoh16 newyorkMember
    edited December 2015

    Hi Geraldine,
    I have a few questions about MuTect output and filtering:
    could you give me the several options to control filter parameters ? I wonder the reason why the candidate reject?
    failure_reasons as follows:
    1) fstar_tumor_lod
    2) possible_contamination
    3) normal_lod
    4) alt_allele_in_normal
    5) poor_mapping_region_alternate_allele_mapq

    I couldn't find them in the document and links ( MuTectArgumentCollection.java ) you given.
    I try to add --enable_extended_output to my command, but it returned lots of error information:
    java -Xmx4g -jar /public/apps/mutect/1.1.7/java.1.7.0_67/mutect-1.1.7.jar --analysis_type MuTect --enable_extended_output
    INFO 17:01:19,267 HelpFormatter - --------------------------------------------------------------------------------
    INFO 17:01:19,328 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.1-0-g72492bb, Compiled 2015/01/21 17:10:56
    INFO 17:01:19,328 HelpFormatter - Copyright (c) 2010 The Broad Institute
    INFO 17:01:19,329 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk
    INFO 17:01:19,335 HelpFormatter - Program Args: --analysis_type MuTect --enable_extended_output
    INFO 17:01:19,358 HelpFormatter - Executing as lzhang9@aecom-c42dfea.scm on Linux 2.6.32-504.1.3.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_67-b01.
    INFO 17:01:19,358 HelpFormatter - Date/Time: 2015/12/04 17:01:19
    INFO 17:01:19,359 HelpFormatter - --------------------------------------------------------------------------------
    INFO 17:01:19,359 HelpFormatter - --------------------------------------------------------------------------------
    INFO 17:01:20,750 GenomeAnalysisEngine - Strictness is SILENT

    ERROR ------------------------------------------------------------------------------------------
    ERROR A USER ERROR has occurred (version 3.1-0-g72492bb):
    ERROR
    ERROR This means that one or more arguments or inputs in your command are incorrect.
    ERROR The error message below tells you what is the problem.
    ERROR
    ERROR If the problem is an invalid argument, please check the online documentation guide
    ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
    ERROR
    ERROR Visit our website and forum for extensive documentation and answers to
    ERROR commonly asked questions http://www.broadinstitute.org/gatk
    ERROR
    ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
    ERROR
    ERROR MESSAGE: Walker requires a reference but none was provided.
    ERROR ------------------------------------------------------------------------------------------

    thanks a lot,
    Cherry

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Sorry for the late reply. Your problem is that your command line is incomplete. There are several arguments missing.

  • Hello Geraldine,
    Sorry for the naive question but I am a little bit confused with the "--required_maximum_alt_allele_mapping_quality_score" parameter. Shouldn't it be the threshold for the minimum required mapping quality ensuring that there is "at least one observation of the mutant allele" with a decent mapping quality?.. Otherwise I can't make sense out of it.
    Thank you!
    Eugenie

  • And one more question concerning filter parameters. In Mutect-1.1.4 there was a "--clipping_bias_pvalue_threshold" argument (defined as "pvalue threshold for fishers exact test of clipping bias in mutant reads vs ref reads"), with a default value of 0.05. However, when using Mutect-1.1.7 I get "ERROR MESSAGE: Argument with name 'clipping_bias_pvalue_threshold' isn't defined".
    Was this argument discarded or renamed (I don't see anything similar in the Mutect-1.1.7 vcf header)?
    The reason I am interested is that I would like to disable it.
    So if the filter is discarded the problem is solved.
    I just want to make sure that the only argument related to soft/hard clipping left is "--heavily_clipped_read_fraction".
    (I have an example when all variant reads are soft clipped but I should probably post this issue separately)
    Thank you!
    Eugenie

  • RikyJKDRikyJKD ItalyMember
    Hi I would like to use mutect disabling the filter that filters out the germline mutations, could you tell me what values I have to set in the above parameters?
    Thanks.
    Riccardo
  • @RikyJKD
    Hi, wouldn't the raw output contain them anyway? If you don't use "only_passing_calls" argument they will be in your output file, marked as "REJECT" with a failure reason like germline_risk/normal_lod/alt_allele_in_normal etc.
    But I would also like to know what exactly are all these candidates.
    I would assume that initially every position with at least 1 variant read (quality filters applied?) is considered and then germline and artifacts are filtered. (Otherwise I can't explain where such a huge number of candidates comes from).
    @Geraldine, please can you comment on it?
    Thank you!
    Eugenie

  • RikyJKDRikyJKD ItalyMember

    Hi thank you for the reply. I think that the best solution is to consider as KEEP all the REJECT that did not pass the filter the filter alt_allele_in_normal, do you agree?
    Thank you.

    Riccardo.

  • RikyJKDRikyJKD ItalyMember

    Sorry, I try to explain me better, I would to consider as KEEP the mutation also detected in the normal because my normal is also a tumor. Do I have to consider all the REJECT that not pass alt_allele_in_normal or also germline_risk and normal_lod?
    Thank you.

    Riccardo

  • @RikyJKD said:
    Sorry, I try to explain me better, I would to consider as KEEP the mutation also detected in the normal because my normal is also a tumor. Do I have to consider all the REJECT that not pass alt_allele_in_normal or also germline_risk and normal_lod?

    Hi,
    I am not sure I understand it correctly: are you calling tumors without normal samples or?..
    I can't imagine why would you call two tumors together.
    And are you looking to retrieve real germline variants or somatic variants present in another tumor sample?

    I didn't test it myself but I would assume that by rescuing candidates marked with "alt_allele_in_normal" you should get most germline variants. However as you can imagine this will also include artifacts (eg coming from misalignments, where only a few variant reads are present in all samples). So you would probably like to filter them further.

    Btw, the alt_allele_in_normal filter is something you can change with "--max_alt_alleles_in_normal_count" argument (1 by default). Will it solve the problem maybe?

  • RikyJKDRikyJKD ItalyMember
    Thanks for the reply. I am using as tumor a tumor resistant to a drug and as control a tumor not resistant to the same drug so I am interested to know what are the mutations that are also present in the tumor not resistant.
    In your opinion is correct my approach?
    Can I also consider germline_risk and norma_lod?
    Thank you.

    Riccardo
  • EugenieEugenie Member
    edited September 2016

    @RikyJKD said:
    Thanks for the reply. I am using as tumor a tumor resistant to a drug and as control a tumor not resistant to the same drug so I am interested to know what are the mutations that are also present in the tumor not resistant.
    In your opinion is correct my approach?
    Can I also consider germline_risk and norma_lod?

    Ah, I see, that's an interesting combination.
    But do you also have a normal sample for these tumors?
    I didn't have such projects myself but I would suggest you call tumors separately.
    Then you will have 2 sets of somatic mutations and you can do all kinds of filtering and intersections.
    I think filtering 2 sets of mutations (eg with GATK SelectVariants tool) is easier than trying to separate rejected candidates which represent a mix of true germline and somatic.
    That's what I was doing when dealing with more than one tumor per patient.
    Let me know if understood you correctly and made myself clear.
    Best,
    Eugenie

  • RikyJKDRikyJKD ItalyMember
    Thanks for the advice. Unfortunately I do not have the normal samples. In order to select the germline mutations that are interesting for me do I have to consider all the REJECT that not pass alt_allele_in_normal or also germline_risk and normal_lod?
    Thank you.

    Riccardo
  • @RikyJKD said:
    Thanks for the advice. Unfortunately I do not have the normal samples. In order to select the germline mutations that are interesting for me do I have to consider all the REJECT that not pass alt_allele_in_normal or also germline_risk and normal_lod?
    Thank you.

    Riccardo

    I see.
    By "germline" you mean present in another cancer sample?
    I would think alt_allele_in_normal is enough.
    But then I am not sure how are you going to separate SNVs from real germline: by allele frequency? (if the tumor content isn't high) filtering against SNPs from public datasets? (wouldn't recommend dbSNP as it has quite some somatic variants nowadays)

    Best,

    Eugenie

  • RikyJKDRikyJKD ItalyMember

    Yes for germline in this case I mean the mutations present in the cancer sample used as normal. I would to consider the allele frequency and dbSNP and thank you for the information of its contents in somatic variants, what databases do you advise instead of dbSNP database in this case?
    Thanks.

    Riccardo

  • @RikyJKD said:
    Yes for germline in this case I mean the mutations present in the cancer sample used as normal. I would to consider the allele frequency and dbSNP and thank you for the information of its contents in somatic variants, what databases do you advise instead of dbSNP database in this case?

    Well...1000 genomes is an obvious option but whether it will be helpful depends on the population you are working with.
    Do you have access to any normal samples?
    The best option would be to create a "panel of normals" using normal samples from the same population which were ideally processed in the same way (have a look at some related threads, eg http://gatkforums.broadinstitute.org/gatk/discussion/6904/panel-of-normals-for-mutect#latest).
    So even if you don't have matched normal samples maybe you have access to some normal samples sequenced in the same center - that would be of great help.

  • RikyJKDRikyJKD ItalyMember

    Thank you but I do not have any normal samples.
    Thanks.

    Riccardo

  • Hi,Dear Geraldine:
    what is the recommended value of parameters --strand_artifact_lod and --strand_artifact_power_threshold if I want to filter strand bias FPs? should I use these two parameters at the same time?

Sign In or Register to comment.