We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!
Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
Somatic short variant discovery (SNVs + Indels)

Purpose
Identify somatic short variants (SNVs and Indels) in one or more tumor samples from a single individual, with or without a matched normal sample.
Reference Implementations
Pipeline | Summary | Notes | Github | Terra |
---|---|---|---|---|
Somatic short variants tumor-normal pair | T-N BAMs to VCF | universal | yes | b37 |
Somatic short variants PON creation | Normal BAMs to PON | universal | yes | b37 |
Expected input
This workflow requires BAM files for each input tumor and normal sample. Input BAMs should be pre-processed as described in the GATK Best Practices for data pre-processing.
Main steps
There are two main steps to this workflow. First we generate a large set of candidate somatic variants, then we filter them to obtain a more confident set of somatic variant calls.
Call candidate variants
Tools involved: Mutect2
Like HaplotypeCaller, Mutect2 calls SNVs and indels simultaneously via local de-novo assembly of haplotypes in an active region. That is, when Mutect2 encounters a region showing signs of somatic variation, it discards the existing mapping information and completely reassembles the reads in that region in order to generate candidate variant haplotypes. Like HaplotypeCaller, Mutect2 then aligns each read to each haplotype via the Pair-HMM algorithm to obtain a matrix of likelihoods. Finally, it applies a Bayesian somatic likelihoods model to obtain the log odds for alleles to be somatic variants versus sequencing errors.
Calculate Contamination
Tools involved: GetPileupSummaries, CalculateContamination
This step emits an estimate of the fraction of reads due to cross-sample contamination for each tumor sample and an estimate of the allelic copy number segmentation of each tumor sample. Unlike other contamination tools, CalculateContamination is designed to work well without a matched normal even in samples with significant copy number variation and makes no assumptions about the number of contaminating samples.
Learn Orientation Bias Artifacts
Tools involved: LearnReadOrientationModel
This tool uses an optional F1R2 counts output of Mutect2 to learn the parameters of a model for orientation bias. It finds prior probabilities of single-stranded substitution errors prior to sequencing for each trinucleotide context. This is extremely important for FFPE tumor samples.
Filter Variants
Tools involved: FilterMutectCalls
Mutect2’s somatic likelihoods model assumes that read errors are independent, so that, for example, four reads each with an error probability of 1/1000 yield a log odds of roughly 1000^4 in favor of being a real variant versus a sequencing error. FilterMutectCalls accounts for correlated errors, that is, the possibility that all variant reads at a site were due to some common source of error. It accomplishes this through several hard filters to detect alignment artifacts and probabilistic models for strand and orientation bias artifacts, polymerase slippage artifacts, germline variants, and contamination. Additionally, it learns a Bayesian model for the overall SNV and indel mutation rate and allele fraction spectrum of the tumor to refine the log odds emitted by Mutect2. It then automatically sets a filtering threshold to optimize the F score, the harmonic mean of sensitivity and precision.
Annotate Variants
Tools involved: Funcotator
At this step we run tools to add information to the discovered variants in our dataset. One of those tools, Funcotator, can be used to add gene-level information to each variant. Funcotator is a functional annotation tool in the core GATK toolset and was designed to handle both somatic and germline use cases. Funcotator reads in a VCF file, labels each variant with one of twenty-three distinct variant classifications, produces gene information (e.g. affected gene, predicted variant amino acid sequence, etc.), and associations to information in datasources. Supported datasources include GENCODE (gene information and protein change prediction), dbSNP, gnomAD, and COSMIC (among others). The corpus of datasources is extensible and user-configurable and includes cloud-based datasources supported with Google Cloud Storage. Funcotator produces either a Variant Call Format (VCF) file (with annotations in the INFO field) or a Mutation Annotation Format (MAF) file.
Additional Information
- What's new with Mutect2 since v4.1.1.0 ?
- (How to) Call somatic mutations using GATK4 Mutect2
- Somatic calling is NOT simply a difference between two callsets
- Funcotator Information and Tutorial
- ActiveRegion determination (HaplotypeCaller & Mutect2)
Evaluating the evidence for haplotypes and variant alleles (HaplotypeCaller & Mutect2)
Local re-assembly and haplotype determination (HaplotypeCaller & Mutect2)
Comments
Just wanted to check in to see if comment is still relevant or if the new documentation has already been uploaded? Thanks so much!
Also checking in to see if there are any updates.
How do i reference this picture if using it?
@alongalor @ehscholl:
For GATK4 Mutect2 related links, see https://software.broadinstitute.org/gatk/blog?id=11337.
The exploratory tutorial is at https://software.broadinstitute.org/gatk/documentation/article?id=11136.
@Rebecca_Donnelly You can credit the figure to the Broad Institute Data Sciences Platform and link to this page.
@alongalor and @ehscholl: these doc pages are trailing a bit behind the state of the workflows themselves, sorry. We plan to have more comprehensive overview-level docs here than are currently available (see the germline short variants for a preview of what we're aiming for) but for now your best bet is to check out the more detailed docs that @shlee referenced above.
I read in the forums somewhere that the workflows are coming out in April, any updates?
@hashish
Hi,
Soo Hee published a blog with links to all Mutect2 related articles.
-Sheila
Thank you @Sheila, I was namely asking if there was a detailed Mutect2 best practice document similar to that of the germline (as mentioned by @Geraldine_VdAuwera ).
@hashish Not yet, we’re working on it.
Why is task CollectSequencingArtifactMetrics deprecated ?
I noticed that in mutect2.wdl task CollectSequencingArtifactMetrics and option run_orientation_bias_filter are deprecated.
Is it because that step is not necessary any more or you have a better tool to replace gatk CollectSequencingArtifactMetrics ?
@woodwordf_aa
Hi,
There is a better tool to replace that step which should be out very soon.
-Sheila
I remember reading that after creating one Panel of Normal it was possible to add more samples to the panel without including all the previously used normal samples. Is this feature still available?
I have a question regarding the wdl workflow. How can I limit the number of core and memory used? I'm running it locally on a server with 40 cores and 500 GB. The process of creating the Panel of Normal (with 2 samples) quickly goes up to 400GB and counting.
Dear gatk team,
I have a question about mutect2.wdl running time.
I ran the mutect2.wdl with mutect2.exome.inputs.json (provided in your github page) in my own VM (24 cores, 120g RAM) by modifying the input files in local path.
It successfully finished in 11mins. generated
HCC1143-filtered.vcf
sized 579k.Does this is reasonable result or super wired? In terra, i saw it cost around 2 hours.
Do we have some benchmark data for somatic snvs + indels workflow?
Do we have some detailed introduction about difference around nio vs non-nio version?
I tried mutect2_nio.wdl with mutect2.exome.inputs.json (provided in the github page) in my own VM (24 cores, 120g RAM) by modifying the input files in local path.
but failed, checked its stderr:
[August 1, 2019 1:10:13 PM UTC] org.broadinstitute.hellbender.tools.GetSampleName done. Elapsed time: 0.02 minutes.
Runtime.totalMemory()=1961361408
A USER ERROR has occurred: The specified fasta file (file:///home/cloud-user/gatk4-somatic-snvs-indels/inputs/Homo_sapiens_assembly19.fasta) does not exist.
Set the system property GATK_STACKTRACE_ON_USER_EXCEPTION (--java-options '-DGATK_STACKTRACE_ON_USER_EXCEPTION=true') to print the stack trace.
Using GATK jar /root/gatk.jar defined in environment variable GATK_LOCAL_JAR
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx3000m -jar /root/gatk.jar GetSampleName -R /home/cloud-user/gatk4-somatic-snvs-indels/inputs/Homo_sapiens_assembly19.fasta -I /home/cloud-user/gatk4-somatic-snvs-indels/inputs/HCC1143.bam -O tumor_name.txt -encode
Actually, the fasta file did exist. Do I need to change the "Dsamjdk.use_async_io_read_samtools=false" to "true" ?