What does MuTect output means?

nikkinathnikkinath GermanyMember
edited January 2016 in MuTect v1

Hello!

Currently, I am working on identification of somatic mutations on exome data. For this, I am using the combination of GATK + MuTect2 using following code:

java -jar /GenomeAnalysisTK-3.5/GenomeAnalysisTK.jar -T MuTect2 -R /ucsc/ucsc.hg19.fasta --dbsnp /broadinstitute/dbsnp_138.hg19.excluding_sites_after_129.vcf  -I:normal GATK_haplotype_unifiedgenotype/output_bowtieoutput.bamaddreplaceread.bam -I:tumor  /GATK_haplotype_unifiedgenotype/output_bowtieoutput_bamaddreplaceread.bam -o outputmutect.txt 

I am not using <cosmic.vcf> as this is not cancer data. So my questions are:

  1. Is this command CORRECT to use for identification of somatic mutations on a noncancerous problem? If the answer is NO, please direct me to the correct paper or discussion.
  2. How do I interpret the output, especially on filter column -- alt_allele_in_normal;clustered_events, alt_allele_in_normal;clustered_events;t_lod_fstar or alt_allele_in_normal;str_contraction. There are other options as well but I cannot find the appropriate documents to understand what does this mean. Please let me know if there is a better way to identify somatic mutations for a rare disease.

Thanks!

Post edited by Geraldine_VdAuwera on

Answers

  • nikkinathnikkinath GermanyMember

    I found the answer for question 2 after rerunning the code. I would still like to get an answer for question 1, should I use cosmic.vcf even if I am not working on cancer data?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    We don't yet have complete documentation for MuTect2 because it is still in beta status. For others who may be interested, the filtering criteria are the following:

    • alt_allele_in_normal: the alternate allele was also found in the normal so this is probably either a shared artifact or a germline event
    • clustered_events: several mutations that are close together (parameters are in the tool doc), which is often the sign of being an artifact
    • t_lod_fstar: log-odds of an event in the tumor, which is measure of confidence of the mutation being real
    • str_contraction: related to short tandem repeats

    The COSMIC file is used to whitelist known mutations reported in COSMIC. With the caveat that I don't know what you're working on and therefore can't judge specifically if this advice is appropriate to you -- I would recommend using it since it shouldn't hurt, and might help rescue mutations of interest that may have been reported in the cancer context but aren't necessarily only associated with cancer -- if that is a possible eventuality.

  • nikkinathnikkinath GermanyMember

    Thanks for your response. I will briefly discuss the pipeline I am working on. I cannot talk much about the data because I don't have permission to do so. I agree regarding COSMIC point, I will include that as well.

    My responsibility in this project is to design a pipeline to identify the somatic mutation. Where we have good coverage for exomic sequence of particular genes. My pipeline consists of following steps -- trimming (trimmomatics) -> alignment(bowtie2) -> AddOrReplaceReadGroups --> HaplotypeCaller + UniviedGenotyper . Did not perform duplicate remove as we have amplicon sequence method. Now in order to identify the somatic mutation, I am using Mutect2. Once I have an output of MuTect2, what should be the flag to suggest somatic mutations? Or the complete output suggests somatic mutation? That would mean that we need to decide the filter?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    MuTect2 will only output mutations that are thought to be somatic.

    You may indeed need to perform additional filtering. Output calls are candidate mutations but we cannot guarantee that they are all real somatic mutations.

  • nikkinathnikkinath GermanyMember

    Thanks! I will make the strategy accordingly.

  • ltnetcaseltnetcase Tianjin, ChinaMember

    @Geraldine_VdAuwera said:
    We don't yet have complete documentation for MuTect2 because it is still in beta status. For others who may be interested, the filtering criteria are the following:

    • alt_allele_in_normal: the alternate allele was also found in the normal so this is probably either a shared artifact or a germline event
    • clustered_events: several mutations that are close together (parameters are in the tool doc), which is often the sign of being an artifact
    • t_lod_fstar: log-odds of an event in the tumor, which is measure of confidence of the mutation being real
    • str_contraction: related to short tandem repeats

    The COSMIC file is used to whitelist known mutations reported in COSMIC. With the caveat that I don't know what you're working on and therefore can't judge specifically if this advice is appropriate to you -- I would recommend using it since it shouldn't hurt, and might help rescue mutations of interest that may have been reported in the cancer context but aren't necessarily only associated with cancer -- if that is a possible eventuality.

    Dear @Geraldine_VdAuwera, I am confused about the 'clustered_events', as in the code:

    https://github.com/broadgsa/gatk-protected/blob/master/protected/gatk-tools-protected/src/main/java/org/broadinstitute/gatk/tools/walkers/cancer/m2/MuTect2.java#L806

    It filters variants with 2 mutation-events with the distance between each other larger than 2bp(start position), not the mutations close to each other. Am I misunderstanding something, or it's just a bug?

    Thanks in advance.

    Issue · Github
    by Sheila

    Issue Number
    928
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    vdauwera
  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @ltnetcase
    Hi,

    Sorry for the delay. I am checking with the team and will get back to you.

    -Sheila

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
    edited June 2016

    @ltnetcase Apologies for the very late response. Your reading of the code is correct, the clustered_events filter evaluates that the events are beyond a certain minimum distance. What we're looking for is the presence of events that occur close enough to each other, ie within the bounds of the variant context (here I believe that would be the active region, so typically a few tens of bases, possible a couple hundred), yet are not adjacent -- because we want to be able to capture complex somatic substitution events.

  • artitandonartitandon Member

    A stupid question, but want to confirm that in the MuTect/MuTect2 option the REF allele is the allele in the normal sample and ALT allele is the allele for the tumor sample? What happens in the case when the REF allele is different from the reference allele of the reference genome? Ex we have A as the reference allele in hg19; and normal sample has C and tumor has T. How will this be reflected in the MuTect VCF file output?

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @artitandon
    Hi,

    The REF allele is always the reference allele at that position. The alternate alleles reflect any allele that is not the reference allele. So, in your example, the REF column will have A and the ALT column will have C,T. The genotypes for both samples will reflect which allele is present.

    I hope that helps.

    -Sheila

  • Hi,

    Sorry to comment on an old thread, but it is very relevant for me. Is there a way to turn off the clustered_event filter? I'm looking for a signature of specific base changes which occur randomly over the genome so it would be interesting to see if there is any difference in the results with the filter turned off.

    And what exactly are the parameters? Nothing I can find in the tools doc seems to answer this, in fact I can't seem to find anything about this filter in the tools doc at all.

    Also it has been stated that this filter is applied because clustering of variants signifies a technical artefact. Can you point me towards any sources for that? Haplotype Caller can be set to "RNAseq Mode" which among other things filters out clusters of 3 or more variants within a 30bp window, are the filters related, either in principal or coding wise? I can't find any sources for why this is done for RNAseq data either so, slightly off this topic, if anyone can explain that or point me towards some sources I'd be grateful.

    Best Wishes

    Jack

  • @Bheat2615 said:
    Hi,

    Sorry to comment on an old thread, but it is very relevant for me. Is there a way to turn off the clustered_event filter? I'm looking for a signature of specific base changes which occur randomly over the genome so it would be interesting to see if there is any difference in the results with the filter turned off.

    And what exactly are the parameters? Nothing I can find in the tools doc seems to answer this, in fact I can't seem to find anything about this filter in the tools doc at all.

    Also it has been stated that this filter is applied because clustering of variants signifies a technical artefact. Can you point me towards any sources for that? Haplotype Caller can be set to "RNAseq Mode" which among other things filters out clusters of 3 or more variants within a 30bp window, are the filters related, either in principal or coding wise? I can't find any sources for why this is done for RNAseq data either so, slightly off this topic, if anyone can explain that or point me towards some sources I'd be grateful.

    Best Wishes

    Jack

    Sorry, I thought this thread related to Mutect2.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @Bheat2615
    Hi Jack,

    In GATK3, there is no way to turn the filter off. But, in GATK4, you can use FilterMutectCalls with --maxEventsInHaplotype. I think setting that to a larger number will help you.

    -Sheila

Sign In or Register to comment.