What input files does MuTect accept / require?

Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie
edited December 2015 in MuTect v1 Documentation

Please note that this article refers to the original standalone version of MuTect. A new version is now available within GATK (starting at GATK 3.5) under the name MuTect2. This new version is able to call both SNPs and indels. See the GATK version 3.5 release notes and the MuTect2 tool documentation for further details.

All analyses done with MuTect typically involve several (though not necessarily all) of the following inputs:

  • Reference genome sequence
  • Sequencing reads for normal tissue and tumor tissue (normal/tumor data)
  • Intervals of interest
  • COSMIC data
  • Panel of normals

Since MuTect is based on GATK, the general format requirements are the same as those described in the GATK documentation on input files.

Below are the input requirements and/or recommendations that are specific to MuTect.

1. Normal/Tumor data

A key component of the MuTect method involves comparing evidence for variation in a tumor sample against a matched normal sample from the same individual, in order to distinguish somatic mutations from germline mutations. So the Best Practice recommendation is to provide both normal and tumor data from the same individual to MuTect for best results. However, it is possible to run MuTect only on tumor samples without a matched normal. If available, a Panel of Normals (PoN) can be used to represent expected germline variation.

2. COSMIC data

COSMIC stands for Catalog Of Somatic Mutations In Cancer. It is a database of variants that have been found to be implicated in cancer processes, maintained by the Sanger Institute (see project website).

MuTect uses the COSMIC data to whitelist variants that are found in tumor samples, to prevent them from being filtered out if they are also present in dbSNP or a panel of normals.

Post edited by Geraldine_VdAuwera on

Comments

  • Where can I download Panel of normals data for MuTect input if we don't have normal tissue as control? Thanks!

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    We don't currently provide a PON resource; you would need to generate your own as described in the MuTect paper.

  • shair_rosenbergshair_rosenberg L'Institut du Cerveau et de la Moelle Épinière (ICM) Salpetriere Hospital, Paris , FranceMember

    hi,

    I tried to solve the cosmic file problem (I have BAM files referenced to hg19 and the cosmic files does not have the same chrimoslomal annotations) without success despite reading the forum posts. Therefore I tried to run it without a COSMIC file - do I loose importnat information ? Will using Oncotator afterwards help ?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Using the COSMIC file allows you to "rescue" known somatic variants that might otherwise be filtered out. Using Oncotator afterward does not replace the COSMIC functionality. I strongly recommend you find a way to use the COSMIC files (either find an hg19 version or realign your data to b37) in order to take advantage of this functionality.

  • shair_rosenbergshair_rosenberg L'Institut du Cerveau et de la Moelle Épinière (ICM) Salpetriere Hospital, Paris , FranceMember

    Thanks for your answer. I performed liftOverVariants and FilterLiftedVariants on the cosmic file supplied by in Mutect site. However - when I run it (14 samples) there was no change - the same number of mutations (around 100-200 per sample).

    • Is it possible (for example - the number of dbSNP somatic mutations is not that large) or it might be that the cosmic part did not work well ?
    • I saw that the cosmic file supplied in MuTect site has ~33,000 mutations - should there be more then that ?
    • chromsome X/Y were not transfered by the liftover so I entered it manualy (with prefix chr)and then used FilterLiftedVariants and it worked well - is this OK ?
    • the MuTect site provides dbsnp_132_b37.leftAligned.vcf - I took a more updated version of dbSNP - dbsnp_138.hg19.vcf but it is not mentioned there leftAligned - is it a problem ?
      Thanks!
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    @shair_rosenberg I'm sorry but I'm not sure I understand what you're asking.

    If you want to know if the lift over process worked, you can check that the resulting file contains the same number of variants and that the information looks correct. It certainly sounds like everything went well. When the program encounters problems it typically complains very loudly!

    If you're trying to figure out whether your variant calling metrics are ok, that's a much bigger question, which I can't help you with because it varies a lot according to experimental design. I recommend looking at the relevant literature to compare with what others typically observe in similar cases.

  • zuoxyzuoxy ChinaMember

    Hi,
    I got an error when running on Normal/Tumor bam file:

    ERROR MESSAGE: java.lang.IllegalArgumentException: Comparison method violates its general contract!

    below is my command:
    java -jar muTect-1.1.4.jar --analysis_type MuTect --intervals regions.intervals --reference_sequence ucsc_hg19.fa --cosmic b37_cosmic_v54_120711.vcf --dbsnp dbsnp_138.hg19.vcf --input_file:normal normal.bam --input_file:tumor tumor.bam --out call_stat.out --coverage_file coverage.wig.txt

    I wonder there was something wrong with the bam files. I used bwa mem to do the alignment, with the RG setting as "@RG\tID:ID171\tLB:172N\tPL:ILLUMINA\tSM:172N" for nomal and "@RG\tID:ID172\tLB:172T\tPL:ILLUMINA\tSM:172T" for tumor. Then samtools sort and rmdup were used to do the "sort & dedup".

    Am I missing some points? Thanks a lot!!

    Hiu

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    This sounds like a known bug in version 1.1.4; try running with the most recent version (available in the GATK downloads page) and let me know if the error persists.

  • zuoxyzuoxy ChinaMember

    @Geraldine_VdAuwera said:
    This sounds like a known bug in version 1.1.4; try running with the most recent version (available in the GATK downloads page) and let me know if the error persists.

    The mutect-1.1.7 seems to work correctly! Many thanks! I have another two questions: 1) Could mutect support multithreading and how could I reduce the total runtime since it require half a month for WES; 2) Could BQSR, indelRealign be performed before mutect to reduce potential false positive of the reads? Or the mutect has considered these issues?

    Thanks a lot!

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    MuTect does not support multithreading; you can parallelize it by exome target interval.

    As noted in the MuTect publication, it is indeed recommended to perform the GATK pre-processing phase of the Best Practices (indel realignment and BQSR). Specifically, you should do co-cleaning of the tumor and normal together if you have T/N pairs. The paper goes over this in more detail.

  • barbarianbarbarian JapanMember

    What kind of sequencing data type is good for MuTect input? RNA-seq or Exome-seq?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    MuTect has not been tested on RNAseq, as far as I know; but it works well on exome sequence.

  • newGATKusernewGATKuser CaseMember

    Hi GATK team,

    I am interested in using MuTect to call mutations from TCGA whole genome sequencing files.
    On the help page for MuTect at https://www.broadinstitute.org/cancer/cga/mutect_run, the inputs include both the tumor and normal bam files. I am able to run MuTect with these input arguments.

    I am trying next to add MuTect to my pipeline using Queue. I found this .scala script at GitHub (https://github.com/broadgsa/gatk/blob/master/public/gatk-queue-extensions-public/src/main/scala/org/broadinstitute/gatk/queue/extensions/cancer/MuTect.scala). Sorry, this may be a basic question, but I am having trouble understanding what inputs to pass to this script. Specifically, since MuTect extends CommandLineGATK, and CommandLineGATK seems to only accept one input file, is there a way to pass to the script both the tumor and normal bam files?

    Thanks a lot for your help,
    Steve

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hi Steve,

    You shouldn't call this script directly. When you add a MuTect step to your Queue script, the program will call this internally. Have you developed Queue scripts previously?

  • newGATKusernewGATKuser CaseMember

    Hi Geraldine,

    Thanks a lot for the reply. I am very new to Scala and Queue scripts. Reading the article https://www.broadinstitute.org/gatk/guide/topic?name=queue, I thought developing the script myself would be too advanced for me. Instead I was able to adapt .scala scripts I found from GitHub such as https://github.com/CuppenResearch/GATK-QScripts/blob/master/IndelRealigner.scala
    to run IndelRealigner with the Queue. This .scala script listed the inputs I should pass it, so I used this command:

    java -Djava.io.tmpdir=tmp -jar ~/jobs/scripts/Queue-3.4-0/Queue.jar -startFromScratch -S ~/jobs/scripts/Queue-3.4-0/resources/IndelRealigner.scala -R ~/jobs/REF_GENOME/hg19.fa -I $SN$_markDup.bam -I $ST$_markDup.bam -mem 32 -nt 16 -nsc 4 -mode "multi" -known ~/jobs/scripts/1000G_phase1.indels.hg19.sites.vcf -known ~/jobs/scripts/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf -jobRunner PbsEngine

    and the job seemed to run successfully. I thought the steps to add MuTect to the pipeline would be similar (i.e., call the .scala script for MuTect, passing it the necessary inputs and other arguments), but as you can see I am running into trouble. I've also tried adapting this .scala script (https://github.com/broadinstitute/mutect/blob/master/MuTectPipeline.scala) that's more similar to the IndelRealinger .scala script I had success with, but calling this .scala script:

    java -Djava.io.tmpdir=tmp -jar ~/jobs/scripts/Queue-3.4-0/Queue.jar -startFromScratch -S ~/jobs/scripts/Queue-3.4-0/resources/MuTect.scala -R ~/jobs/REF_GENOME/hg19.fa -tb $ST$.realigned_recalibrated.bam -nb $SN$.realigned_recalibrated.bam -nsc 4 -o $ST$_call_stats.out -jobRunner PbsEngine

    got me this error:
    ERROR MESSAGE: Invalid command line: Malformed walker argument: Could not find walker with name: MuTect

    Could you explain more how to add MuTect to Queue? I appreciate very much any suggestions as I know getting MuTect to run with Queue will save many hours in the analysis.

    Sincerely,
    Steve

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    OK, so you have two problems.

    One of your problems is that some of the scripts you found (the first two in your latest post) are indeed pipelining scripts, but the MuTect script you referred to originally is not a pipelining script, it's the GATK extension that you would call to add MuTect to your script. Basically what you need to do is to similar to the realignment script, and eg where you have

    val targetCreator = new RealignerTargetCreator with TCIR_Arguments
    

    you would replace that with something like

    val muTect = new MuTect [with SomeArguments ]
    

    where adding the arguments trait is optional.

    Your other problem is our fault: we're in the process of adding MuTect to GATK, and that's why there's a MuTect script in the Queue extensions, but the MuTect code it calls into is still in the private module of GATK, and you can't use it yet unless you're at the Broad. If not, you'll have to wait another two months or so to use the GATK-MuTect integration in Queue; or you'll have to call MuTect through an external call to the standalone MuTect jar file, which is awkward but can be made to work.

  • newGATKusernewGATKuser CaseMember

    Thanks Geraldine! I think I'll stick with calling MuTect with the standalone MuTect jar file, and wait until MuTect is integrated with GATK before trying to use it with Queue. Like you suggested in an earlier post, I may try to parallelize MuTect by breaking up the job by regions of interest.

  • subrataghoshsubrataghosh NEW DELHIMember

    I have installed "muTect-1.1.4.jar" and it is working in my Ubuntu OS. I have downloaded
    "--reference_sequence Homo_sapiens_assembly19.fasta
    --dbsnp dbsnp_132_b37.leftAligned.vcf
    --cosmic hg19_cosmic_v54_120711.vcf".
    But I need "Normal.cleaned.bam and Tumor.cleaned.bam" files as example files. From where can I download the sample files? I could not find them in the folders, ftp://ftp.broadinstitute.org/bundle/
    Would you kindly help me to have the proper input files. Thanks.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hi @subrataghosh,

    We don't currently provide example bam files for MuTect, sorry. You can get test data from the DREAM challenge website.

  • Hello,
    I recently came across the PPT in the Google drive related to ContEst (https://drive.google.com/a/case.edu/folderview?id=0BwTg3aXzGxEDVk5RcEF3WW1SQWM&usp=sharing#)
    The presentation says the contamination estimate of the sample (output from ContEst) is used in MuTect. But I'm having trouble finding in the MuTect documentation how to input this information? Any help would be much appreciated.

    Thanks a lot,
    Steve

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    @newGATKuser You use the argument --fraction_contamination to pass in the value from ContEst.

  • Thanks for the information, @Geraldine_VdAuwera !
    Is there a place I can find all the arguments for MuTect? For example, I couldn't find this argument in the help page when I typed: java -jar mutect-1.1.7.jar --help

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    If you add -T MuTect it should show up.

    For the next version the full MuTect docs will be included in the GATK docs.

  • Cool, thanks a lot

  • rernstrernst UMC UtrechtMember

    Your other problem is our fault: we're in the process of adding MuTect to GATK, and that's why there's a MuTect script in the Queue extensions, but the MuTect code it calls into is still in the private module of GATK, and you can't use it yet unless you're at the Broad. If not, you'll have to wait another two months or so to use the GATK-MuTect integration in Queue; or you'll have to call MuTect through an external call to the standalone MuTect jar file, which is awkward but can be made to work.

    @Geraldine_VdAuwera, do you have a new eta on the GATK-MuTect integration in Queue? With the latest gatk I still get the "Malformed walker argument: Could not find walker with name: MuTect" error.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    @rernst I think we're looking at a few more weeks.

  • @Geraldine_VdAuwera As you mentioned, we will have to wait some time to use the GATK-MuTect in Queue. So, do you know how to run the scala script for mutect posted on the github "https://github.com/broadinstitute/mutect/blob/master/MuTectPipeline.scala"? I'm confused. Directly run the scala script under command line?? I do want to use scatter/gather feature because it would take so long time to use MuTect.jar for WGS somatic mutation analysis.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    @mxqian You won't be able to use that scala script, sorry. You will need to write a script to call MuTect as if it was a normal command line program. Some guidance is provided in the Queue presentations in our last workshop (see Presentations section in the Guide).

  • @Geraldine_VdAuwera said:
    We don't currently provide a PON resource; you would need to generate your own as described in the MuTect paper.

    Can you provide more information on the PON resource? What format is it in, and which input parameter takes it?
    Based on the documentation, it is not clear which input parameter is the PON resource?

    Thanks,

    java -Xmx2g -jar muTect-XXXX-XX-XX.jar
    --analysis_type MuTect
    --reference_sequence
    --cosmic <cosmic.vcf>
    --dbsnp <dbsnp.vcf>
    --intervals
    --input_file:normal <normal.bam>
    --input_file:tumor <tumor.bam>
    --out <call_stats.out>
    --coverage_file <coverage.wig.txt>

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Sorry, we need to add his to the docs. In the meantime, search the forum (search box, top right corner) for panel of normals and you will find your answers.

  • Where can I get the Mus_musculus_assembly9.fasta and dbsnp_128_mm9.vcf files required to process the mouse genome? I wasn't able to find them in the GATK bundle.

  • saasaa saaMember

    Dear Geraldine,
    Could you provide some guidelines as to how execute mutect when a matched normal is not available?
    Many thanks!

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    @pallavigudipati Sorry for the very late response; we do not yet provide resource files for somatic analysis in the GATK bundle.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    @saa, We do not support doing this yet, but you can run MuTect on a tumor sample alone. However you should provide a panel of normals, a dbsnp file and a COSMIS file if possible.

  • XiaYan_7557XiaYan_7557 Shenzhen ChinaMember

    Dear developer,
    I want to use the MuTect2 by the parameter [-L targets.interval_list],but I don't know what type of the list file I need input, can you give me one example. And Whether MuTect2 will output the coverge file? I can't find the relevant parameter.
    Thank you!

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    @XiaYan_7557 This discussion is about standalone versions of MuTect (up to 1.1.7). For MuTect2, please see the GATK tool documentation in the Guide section.

  • weini_huangweini_huang LondonMember

    Dear Geraldine_VdAuwera,
    Thank you for organising this section of question& answers. It is very helpful.
    I want to use Mutect to call variants between before treatment tumour samples and drug-resistant tumour samples.
    I saw in your examples in the website the input files are normal samples and tumour samples as below:
    --input_file:normal <normal.bam>
    --input_file:tumor <tumor.bam>
    Can I use this software for my purpose but change the input files as
    --input_file:normal <before_treatment_tumour.bam>
    --input_file:tumor <drug_resistant_tumour.bam>
    ?

    Thank you very much for your help.
    Weini

Sign In or Register to comment.