To celebrate the release of GATK 4.0, we are giving away free credits for running the GATK4 Best Practices pipelines in FireCloud, our secure online analysis portal. It’s first come first serve, so sign up now to claim your free credits worth $250. Sponsored by Google Cloud. Learn more at https://software.broadinstitute.org/firecloud/documentation/freecredits

What is the output of MuTect and how should I interpret it?

Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie
edited December 2015 in MuTect v1 Documentation

Please note that this article refers to the original standalone version of MuTect. A new version is now available within GATK (starting at GATK 3.5) under the name MuTect2. This new version is able to call both SNPs and indels. See the GATK version 3.5 release notes and the MuTect2 tool documentation for further details.

Overview

MuTect produces a lot of information that is spread across several different files. This document describes the most important outputs and how to interpret them. For a complete list of outputs and their description, please use the -help flag at the command line.

* Call-stats file

The main output which people typically work with is the "call-stats" file. It is an exhaustive report of all the metrics and statistics available about the calls made by MuTect and the filters that are applied internally by default. See further below for a more complete description of the call-stats output.

* VCF file of candidate mutations

Upon request, MuTect can output a summary VCF file containing the mutation candidates annotated with KEEP or REJECT in the FILTER field.

* Coverage / WIGGLE files

Also upon request, MuTect can output so-called "wiggle" files (in WIGGLE format) that contain useful information about the read coverage observed in the data. This format indicates for every base whether it is sufficiently covered in the tumor and normal to be sensitive enough to call mutations. We currently use cutoffs of at least 14 reads in the tumor and at least 8 in the normal (these cutoffs are applied after removing noisy reads in the preprocessing step). There are several different files that can be generated, containing e.g. overall coverage, just the tumor, just the normal, and so on.


More details about the call-stats file and how to use it

The call-stats output contains a lot of information that is intended to help with development, but that most users don't need to take into account in their analysis. Since this can be rather confusing, we recommend that you extract subsets of information from the call-states file according to your needs, rather than try to work with the whole thing.

Extracting subsets of data using grep

The most common subset you'll want to work with is the set of confident calls that were not rejected by MuTect's internal filters. An easy way to do this using basic Unix tools is to search for lines that don't contain the string REJECT:

grep -v REJECT <my.call_stats.txt>

You can also select subsets of sites that were filtered for specific reasons, in case you want to "rescue" those sites. This is the equivalent of disabling MuTect's internal filters, which is currently hard to do from command line.

Understanding the main statistics / fields

Here are the definitions of some of the most prominent outputs in the call-stats file:

  • contig: the contig location of this candidate
  • position: the 1-based position of this candidate on the given contig
  • ref_allele: the reference allele for this candidate
  • alt_allele: the mutant (alternate) allele for this candidate
  • tumor_name: name of the tumor as given on the command line, or extracted from the BAM
  • normal_name: name of the normal as given on the command line, or extracted from the BAM
  • score: for future development
  • dbsnp_site: is this a dbsnp site as defined by the dbsnp bitmask supplied to the caller
  • covered: was the site powered to detect a mutation (80% power for a 0.3 allelic fraction mutation)
  • power: tumor_power * normal_power
  • tumor_power: given the tumor sequencing depth, what is the power to detect a mutation at 0.3 allelic fraction
  • normal_power: given the normal sequencing depth, what power did we have to detect (and reject) this as a germline variant
  • total_pairs: total tumor and normal read depth which come from paired reads
  • improper_pairs: number of reads which have abnormal pairing (orientation and distance)
  • map_Q0_reads: total number of mapping quality zero reads in the tumor and normal at this locus
  • init_t_lod: deprecated
  • t_lod_fstar: CORE STATISTIC: Log of (likelihood tumor event is real / likelihood event is sequencing error )
  • tumor_f: allelic fraction of this candidated based on read counts
  • contaminant_fraction: estimate of contamination fraction used (supplied or defaulted)
  • contaminant_lod: log likelihood of ( event is contamination / event is sequencing error )
  • t_ref_count: count of reference alleles in tumor
  • t_alt_count: count of alternate alleles in tumor
  • t_ref_sum: sum of quality scores of reference alleles in tumor
  • t_alt_sum: sum of quality scores of alternate alleles in tumor
  • t_ins_count: count of insertion events at this locus in tumor
  • t_del_count: count of deletion events at this locus in tumor
  • normal_best_gt: most likely genotype in the normal
  • init_n_lod: log likelihood of ( normal being reference / normal being altered )
  • n_ref_count: count of reference alleles in normal
  • n_alt_count: count of alternate alleles in normal
  • n_ref_sum: sum of quality scores of reference alleles in normal
  • n_alt_sum: sum of quality scores of alternate alleles in normal
  • judgement: final judgement of site KEEP or REJECT (not enough evidence or artifact)
Post edited by Geraldine_VdAuwera on

Comments

  • Dear Geraldine,
    First of all, thank you so much for this documentation, it's really useful.

    Sorry if this is not the right place to ask but I have a question concerning coverage (WIGGLE) files. I am wondering which % of my exome regions are "usable"(callable) for Mutect in every tumor/nornal pair (meet the coverage requirement for both). Do I understand it correctly that calculating "grep -c "1" <my.coverage.wig.txt>" will actually give me the number of callable positions and "1"/("1"+"0") will be the ratio I am looking for?

    Thank you!

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Totally the right place to ask. Sure, that's a valid way to get a general count/fraction of callable sites, assuming you used -L to restrict analysis to your exome intervals (otherwise the wiggle file covers the entire genome, not just the exome).

  • Dear Geraldine
    I am running MuTect with matched normal and tumor samples. I want to have the variant allele frequency of variants in the tumor sample.

    can you please tell me how can I get this information from vcf or from standard output?

    Thanks

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hi @tbiagini ,

    You can find information on how to do this in the GATK documentation.

  • Jie LiuJie Liu UW-SeattleMember

    Dear Geraldine,
    Thanks a lot for the documentation which helps a lot. I have a question regarding to running MuTect.

    I noticed that MuTect reports a list of loci --- some labeled with "Pass" and some with "Reject". But these are not all the loci on the genome. What happens to other loci which are not on the list?

    I'm interested in the read count supporting reference allele and the read count supporting alternative allele at one specific locus (which are not reported by MuTect). Since MuTect has a rigorous screening of the reads --- removing the reads with low score, etc... I also would like the read counts (supporting reference allele and the alternative allele) after the removal of these unreliable reads. Could you please suggest a way of doing that?

    Many, many thanks!

    JL

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hi @Jie Liu ,

    The loci that are not included are sites where MuTect found no evidence of variation at all. I think there is an option to emit non-variant sites (I'd have to look it up) but they will not be really informative.

    You can get allele depths in the optional VCF output of MuTect (the AD annotation) but AD is unfiltered. Only the total depth (DP) is shown after filtering.

  • yuzhyuzh ChinaMember

    Dear Geraldine,

    I use mutect-1.1.7.jar to call SNV. I cant understand some ouputs of mutect. Although I execute command "java -jar mutect-1.1.7.jar -T MuTect -h", I only find details of the parameters instead of the outputs. Whats meaning of the 't_q20_count'? And what`s the relationship between 't_q20_count' and 't_ref_count' and 't_alt_count'?

    Here are some ouputs of these field from my SNV data. I just copy three columns.
    No t_q20_count t_ref_count t_alt_count
    1 21 25 4
    2 11 6 2
    3 16 15 3
    4 24 20 4
    5 18 18 4

    I thought 't_q20_count' > ('t_ref_count' + 't_alt_count'), such as No.2, 11 > 6+2. But it`s not always like that, such as No.1. So I am puzzled with 't_q20_count'. Could you explaiin that to me?

    Thank you!

    YuZhang

  • david.knoffdavid.knoff Boston, MAMember

    Dear @Geraldine_VdAuwera

    Thank you for providing all of this documentation along with the helpful questions and answers.

    Regarding the output VCF file from MuTect, does this file require joint genotyping/is this possible? Or should I continue on to the variant filtering step with this output VCF file?

    Thanks!
    David

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Joint genotyping is not possible with the current version of MuTect. Note also that the filtering methods recommended for germline analysis are not applicable for somatic variants. MuTect itself applies filtering internally. We do not offer any additional guidance for analysis of the somatic variants after the MuTect step. See Phase 2B materials in https://www.broadinstitute.org/gatk/guide/presentations?id=6007 for current recommendations on somatic variant discovery.

  • EADGEADG KielMember

    Hi,

    maybe the value 't_q20_count' is also interesting for others. Tracing it to GitHub (Mutect) i got following Code:

    int tumorQ20BaseCount = tumorReadPile.getFilteredBaseCount(20);    
    
    public int getFilteredBaseCount(int minBaseQualityScore) {
        return this.finalPileup.getBaseFilteredPileup(minBaseQualityScore).depthOfCoverage();
    }
    

    Which means that t_q20_count are all reads above a quality score of 20.

    Greetings

    EADG

  • GumiltonGumilton United StatesMember

    Dear @Geraldine_VdAuwera ,

    I have finished muTect on a large set of samples. Each individual give me a SampleXXX.out file with a coverage.wig.txt.
    I want to merge all the results file (*.out) into one vcf file. Could you please let me know I can achieve that? Thanks!

    I used bcftools merge but failed.
    ]$ ~/tools/bcftools/bcftools merge *out > merged_GATK_MuTect.vcf
    Failed to open HCC_12.out: unknown file type

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    The *.out files output by MuTect are not VCF files and cannot be handled by bcftools. You can have MuTect output VCF files as an option.

  • GumiltonGumilton United StatesMember

    @Geraldine_VdAuwera , Thank you!
    I have run MuTect in a lot of samples... it takes one week to finish. Is there any way I can convert the *.out files to .vcf format?
    Otherwise I may have to run MuTect again on so many samples...

    Thanks!

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    @Gumilton I'm not aware of any straightforward way to do so, but there may be conversion tools out there that I do not know of.

    Please don't take this personally, but this is a good example of why it's useful to plan your analysis from start to finish and test your pipeline on a subset of the data before running on the full dataset.

  • XBonXBon SwitzerlandMember

    Hi,

    Mutect 1.1.7 outputs additional fields in the *.out file when compared to Mutect 1.1.4.
    Most of them are easy to figure out but not all of them are straightforward. I haven't been able to find documentation regarding these fields, they are not in the --help either. Is there a document for Mutect 1.1.7's output?

    I'm sorry if this has been discussed before. Thank you in advance!

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hi @XBon, the rule of thumb is that the outputs that are not documented are for the most part things that the developers put in to test if they might be useful, but turned out not to be really informative -- so my recommendation is to simply ignore them.

  • XBonXBon SwitzerlandMember
  • Hi @Geraldine_VdAuwera, I found one unexpected thing in the MuTect result of vcf format, in which the order of tumor and normal sample was not always consistent. Tumor data information is in the left column in some pairs, and in the right column in other pairs. It's not a big problem, but still a little strange.

    EX.
    FORMAT Germline Tumor
    GT:AD:BQ:DP:FA:SS 0:10,0:.:10:0.00:0 0/1:16,4:25:20:0.200:2

    FORMAT Tumor Germline
    GT:AD:BQ:DP:FA:SS 0/1:8,3:29:11:0.273:2 0:33,0:.:15:0.00:0

    Does anyone see this?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    This is because the columns are ordered alphabetically by sample name, if I recall correctly.

  • Dear Geraldine,

    Please, a question about how to interpret the read counts in the output generated by Mutect.

    In my output files i have several variants of several genes where the total_reads is a number that does not match with the sum of the columns "t_ref_count" and "t_alt_count" and the difference is quite important.

    Here i paste some examples of one gene

    contig position context ref_allele alt_allele total_reads t_ref_count t_alt_count
    chr17 45214544 GACxCTG T C 145 33 4
    chr17 45214558 TTCxAAA G A 189 36 8
    chr17 45214559 TCGxAAA A G 188 44 0
    chr17 45214564 AACxAGC A T 198 38 9
    chr17 45214582 CCAxTTC A G 192 48 5
    chr17 45214592 AGTxAAG T G 185 49 4
    chr17 45214600 CAAxCTC A G 161 48 0
    chr17 45214604 CTCxTGC A T 154 39 7
    chr17 45214605 TCAxGCC T C 153 38 8
    chr17 45214606 CATxCCC G T 153 39 7
    chr17 45214614 AATxGAG A G 45 35 0

    Am i missing something Geraldine or is it perhaps that mutect only writes in the columns "t_ref_count" and "t_alt_count" those reads that pass several basal filters related with the appropriate paired or others?

    Thank you very much in advance

    carlx

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Yes, MuTect applies some quality filters that may remove some reads from the read counts.

  • ok Geraldine,thank you very much for your reply, that is what i was thinking, indeed.
    There is a way to disable these filters by my own? i imagine that it is not recommendable to do it, and i won't, but i was just wondering just for curiosity if it is possible. That might help a bit when looking for very low frequency variants.
    carlx

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    For the most part it's not possible, and as you surmised correctly, it wouldn't be advisable anyway.

  • morovatuncmorovatunc turkeyMember

    @Geraldine_VdAuwera said:
    The *.out files output by MuTect are not VCF files and cannot be handled by bcftools. You can have MuTect output VCF files as an option.

    Dear Geraldine, I did not know about the out --vcf part option, therefore I only had an stat output I believe. Is there a way to transform it to vcf without running it again ? :(

    Thank you in advance.

    Tunc.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    We don't provide any tools to do that conversion but there may be conversion utilities out there.

  • daveliudaveliu Cambridge, MAMember, Broadie

    Hi, I'm trying to determine the definition of "covered base" -- I see here that the default is 8 reads in the normal and 14 reads in the tumor. Is there a minReadQuality filter for a valid read when counting the number of bases?

    Or does the notion of a "covered base" incorporate "tumor power" and "normal power" (0.8 power to detect a 0.3 AF variant)?

    Apologies if this is not in the right place in the forum...

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hi @daveliu, "covered base" just accounts for the depth of coverage at a site. There are two main filters applied before this is evaluated; mapping quality of the read and base quality of the individual base calls. I don't have access to the actual values right now but this is something we can look up if needed.

  • Dear Geraldine,
    Could you please explain which type of information contain the following mutect output column

    observed_in_normals_count :

    thank you in advance!

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    That's the number of reads supporting an alt allele were observed in the normal sample.

  • hi Geraldine,
    I searched around and still couldn't see criteria for which MuTect labels variants as reject. How would I know why a variant was rejected? The information about MuTect filters and thresholds are scattered in multiple posts. I wish everything could be in one detailed document. thanks.

  • @afadda said:
    hi Geraldine,
    I searched around and still couldn't see criteria for which MuTect labels variants as reject. How would I know why a variant was rejected? The information about MuTect filters and thresholds are scattered in multiple posts. I wish everything could be in one detailed document. thanks.

    Hi @afadda,
    Are you using the "--enable_extended_output option"? (see http://gatkforums.broadinstitute.org/dsde/discussion/3483/why-this-mutation-is-rejected ; It should work for Mutect-1.1.4 and Mutect-1.1.7 I have been using).
    If yes your CALLSTATS output file includes "failure_reasons judgement" column which is what you are looking for.
    You would probably also like to look at this post on Mutect filters: http://gatkforums.broadinstitute.org/gatk/discussion/4464/how-mutect-filters-candidate-mutations

    Concerning thresholds and default values - I am also dreaming about explicit documentation.
    But before this dream comes true you can just check the .vcf header which contains all argument values.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    We've been waiting on some improvements to MuTect2 that were backlogged for a long time pending key feedback from analysts who were testing it. Now this work has been reprioritized and the tool is back in active development. Once the work is done and the tool is considered more mature, we will put effort into documenting it fully. I expect this will happen sometime around December 2016. I know it's a long time to wait but it's all we can do with current resources -- bear with us and maybe we can make your dreams come true eventually :)

    Note however that at that time, the original MuTect (which is the topic of this thread) will be officially retired.

  • Hi Geraldine,
    Thanks for the reply and sorry for our moaning;) GATK documentation is already amazing and I greatly appreciate the efforts you (and the whole team) put in it.

    So Mutect1 will be officially deprecated once a stable Mutect2 version is released?
    I am looking forward to testing it but I will probably stick to Mutect1 in the meanwhile as I am a little bit concerned about the higher false positive rate reported for Mutect2 beta.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hah, no problem, we welcome all constructive feedback :)

    Yes, that's the intention -- once MuTect2 is considered stable enough to be adopted into our production pipeline, MuTect1 will be officially deprecated. If I had to guesstimate a timeline, I would put that deprecation on a six month horizon.

    Unofficially, I can share that the cancer researchers we interact with report that they use MuTect1 for SNVs and MuTect2 for indels in their work.

  • @Geraldine_VdAuwera
    Thank you for your patience;)

    MuTect1 for SNVs and MuTect2 for indels

    Hm, that makes sense, thanks for sharing)

  • JeremyAdamsJeremyAdams OICRMember

    Hi Geraldine, thanks for this documentation.

    I have a question about 3 stats in the call-stats file: init_n_lod, init_t_lod, and t_lod_fstar. You mention that init_t_lod is deprecated, does that mean init_n_lod is deprecated as well? Or is init_n_lod the normal allele equivalent of t_lod_fstar?

    I'm also wondering how these stats compare to the TLOD and NLOD scores in mutect2. I'm assuming that mutect1's t_lod_fstar is the equivalent of mutect2's TLOD, and that mutect1's init_n_lod is the equivalent of mutect2's NLOD. Am I right to assume this?

    Please let me know how these stats compare between mutect1 and 2, as we use both callers in our pipeline.

    Thank you!

    Issue · Github
    by shlee

    Issue Number
    2312
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    sooheelee
  • shleeshlee CambridgeMember, Broadie, Moderator

    Just to be clear @JeremyAdams, which version of M2 are you referring to, GATK3 or GATK4?

  • JeremyAdamsJeremyAdams OICRMember

    GATK 3.6.0 for Mutect2

  • shleeshlee CambridgeMember, Broadie, Moderator

    I will ask Geraldine to get back to you @JeremyAdams.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hi @JeremyAdams, the code that calculates the various LOD scores in MuTect2 in GATK 3.x was lifted pretty much wholesale from Mutect1, so there shouldn't be any major differences. However in the GATK4 version which is much improved, there have been a lot of changes. We're not going to track specifically what changed to what (because pragmatically that's more effort than we can spare), and instead we'll focus on providing very detailed docs about how things are done in GATK4 Mutect2. Expect those docs to roll out over the next month or so, in time for the 4.0 release.

Sign In or Register to comment.