We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

(How to) Call somatic mutations using GATK4 Mutect2

bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin
edited July 2019 in Tutorials

This tutorial is applicable to Mutect2 version and higher. Post suggestions/questions in the Ask the GATK team section.

GATK contains several major advances in the Mutect2 pipeline, which is good, but we have had to change command lines in a few places, which we usually try to avoid. Here are the major changes.

Bug Fixes

We fixed several bugs with error messages about invalid log probabilities, infinities, NaNs etc that were introduced by neglecting to account for finite precision errors. We also resolved an issue where CalculateContamination worked poorly on very small gene panels.

New Required Inputs to FilterMutectCalls

1) FilterMutectCalls now requires a reference, which should be the same fasta file input to Mutect2.
2) FilterMutectCalls also requires a stats file from Mutect2. When you run Mutect2 with output -O unfiltered.vcf, for example, it produces a file called unfiltered.vcf.stats. FilterMutectCalls automatically looks for this file, so if you are running locally no change is needed. That is,

gatk Mutect2 -R ref.fasta -I tumor.bam -O unfiltered.vcf

followed by

gatk FilterMutectCalls -R ref.fasta -V unfiltered.vcf -O filtered.vcf

works because Mutect2 creates unfiltered.vcf.stats behind the scenes and FilterMutectCalls knows to look for it. However, if you are running on a cluster or the cloud you need to keep track of the stats file. For example, you need to delocalize it from a VM, as is done in the Mutect2 WDL. You can explicitly input the stats file with the -stats argument in FilterMutectCalls. If you are scattering Mutect2 over multiple nodes you must merge the stats files with MergeMutectStats:

gatk MergeMutectStats -stats unfiltered1.vcf.stats -stats unfiltered2.vcf.stats -O merged.stats

and pass merged.stats to FilterMutectCalls.

New Filtering Strategy in FilterMutectCalls

FilterMutectCalls now filters based on a single quantity, the probability that a variant is a somatic mutation. Previously, different reasons why this might not be the case each had their own thresholds. We have removed parameters such as -normal-artifact-lod, -max-germline-posterior, -max-strand-artifact-probability, and -max-contamination-probability. Even the previously fundamental -tumor-lod is gone. Rather than replace these with a single threshold for the error probability, FilterMutectCalls replaces them with nothing at all. It automatically determines the threshold that optimizes the "F score", the harmonic mean of sensitivity and precision. In order to tweak results in favor of more sensitivity users may set -f-score-beta to a value greater than its default of 1 (beta is the relative weight of sensitivity versus precision in the harmonic mean). Setting it lower biases results toward greater precision.

You can think of these changes as doing two things. Unifying filtering thresholds puts all the filters at the same place on a ROC curve of sensitivity vs precision. Previously, one threshold might be sacrificing a lot of sensitivity for a small gain in precision while another might be doing the opposite, the result being poor sensitivity and precision that fell below the potential ROC curve. Once we’re on that ROC curve, we achieve a good balance between sensitivity and precision by optimizing the F score.

New Somatic Clustering Model in FilterMutectCalls

We had long suspected that modeling the spectrum of subclonal allele fractions would help distinguish somatic variants from errors. For example, if every somatic variant in a tumor were a het occurring in 40% of cells, we would know to reject anything with an allele fraction significantly different from 20%. In the Bayesian framework of Mutect2 this means that the likelihood for somatic variants is given by a binomial distribution. We account for an unknown number of subclones with a Dirichlet process binomial mixture model. This is still an oversimplification because CNVs, small subclones, and genetic drift of passenger mutations all contribute allele fractions that don’t match a few discrete values. Therefore, we include a couple of beta-binomials in the mixture to account for a background spread of allele fractions while still benefiting from clustering. Finally, we use these binomial and beta-binomial likelihoods to refine the tumor log odds calculated by Mutect2, which assume a uniform distribution of allele fractions.

In addition to clustering allele fractions we also learn the overall prior probabilities of somatic SNVs and indels so that we can be more skeptical of apparent variants in a quiet tumor, for example. We learn the parameters of this model with a stochastic EM approach, where the E steps consist of Chinese Restaurant Process sampling of the allele fraction clusters. In case you were wondering, we have tested this new approach on some of the off-label, non-cancer uses of Mutect2, such as mitochondria, and it works very well.
FilterMutectCalls reports the learned parameters of somatic clustering in a new .filtering_stats.tsv output. This file also contains information on the probability threshold chosen to optimize the F score and the number of false positives and false negatives from each filter to be expected from this choice.

A step-by-step guide to the new Mutect2 Read Orientation Artifacts Workflow

If you suspect any of your samples of substitution errors that occur on a single strand before sequencing you should definitely use Mutect2's orientation bias filter. This applies to all FFPE tumor samples and samples sequenced on Illumina Novaseq machines, among others. In fact, with the optimizations in you can run the filter even when you're not suspicious. It won't hurt accuracy and the CPU cost is now quite small.

There are three steps to the filter. First, run Mutect2 with the --f1r2-tar-gz argument. This creates an output with raw data used to learn the orientation bias model. Previously this was done by CollectF1R2Counts. By absorbing it into Mutect2, we eliminated the cost of CollectF1R2Counts with almost no effect on the runtime of Mutect2. When multiple tumor samples are specified, you only need a single --f1r2-tar-gz output, which contains data for each tumor sample.

gatk Mutect2 -R ref.fasta \
   -L intervals.interval_list \
   -I tumor.bam \
   -germline-resource af-only-gnomad.vcf \
   -pon panel_of_normals.vcf   \
   --f1r2-tar-gz f1r2.tar.gz \
   -O unfiltered.vcf

Next, pass this raw data to LearnReadOrientationModel:

gatk LearnReadOrientationModel -I f1r2.tar.gz -O read-orientation-model.tar.gz

Run GetPileupSummaries to summarize read support for a set number of known variant sites.

gatk GetPileupSummaries \
    -I tumor.bam \
    -V chr17_small_exac_common_3_grch38.vcf.gz \
    -L chr17_small_exac_common_3_grch38.vcf.gz \
    -O getpileupsummaries.table

Estimate contamination with CalculateContamination.

gatk CalculateContamination \
    -I getpileupsummaries.table \
    -tumor-segmentation segments.table \
    -O calculatecontamination.table

Finally, pass the learned read orientation model to FilterMutectCallswith the -ob-priors argument:

gatk FilterMutectCalls -V unfiltered.vcf \
   [--tumor-segmentation segments.table] \
   [--contamination-table contamination.table] \
   --ob-priors read-orientation-model.tar.gz \
   -O filtered.vcf

Advanced note: if you are scattering Mutect2 over nodes in a cluster or on the cloud, you must input the --f1r2-tar-gz output from each Mutect2 scatter to LearnReadOrientationModel. This is done automatically in the Mutect2 wdl in the gatk repo and on Terra. For example:

for chromosome in {1..22}; do
gatk Mutect2 -R ref.fasta -I tumor.bam -L $chromosome --f1r2-tar-gz ${chromosome}-f1r2.tar.gz -O ${chromosome}-unfiltered.vcf``
all_f1r2_input=`for chromosome in {1..22}; do printf -- "-I ${chromosome}-f1r2.tar.gz "; done`
LearnReadOrientationModel $all_f1_r2_input -O read-orientation-model.tar.gz

A step-by-step guide to the new Mutect2 Panel of Normals Workflow

We rewrote CreateSomaticPanelOfNormals to use GenomicsDB for scalability. We also added the INFO fields FRACTION (the fraction of normal samples with an artifact) and BETA (the shape parameters of the beta distribution of artifact allele fractions among samples with the artifact) to the panel of normals. FilterMutectCalls doesn’t use these yet, but we hope to experiment with them in the near future. Furthermore, we never liked how germline variants, which are handled in a more principled way with our germline filter, ended up as hard-filtered pon sites, so the panel of normals workflow now optionally excludes germline events from its output, keeping only technical artifacts.

The three steps to create a panel of normals are:

Step 1: Run Mutect2 in tumor-only mode for each normal sample:

gatk Mutect2 -R reference.fasta -I normal1.bam   --max-mnp-distance 0 -O normal1.vcf.gz
gatk Mutect2 -R reference.fasta -I normal2.bam   --max-mnp-distance 0 -O normal2.vcf.gz

Step 2: Create a GenomicsDB from the normal Mutect2 calls:

gatk GenomicsDBImport -R reference.fasta -L intervals.interval_list \
  --genomicsdb-workspace-path pon_db \
  -V normal1.vcf.gz \
  -V normal2.vcf.gz \
  -V normal3.vcf.gz

Step 3: Combine the normal calls using CreateSomaticPanelOfNormals:

gatk CreateSomaticPanelOfNormals -R reference.fasta \
  --germline-resource af-only-gnomad.vcf.gz \
  -V gendb://pon_db \
  -O pon.vcf.gz

In the third step, the AF-only gnomAD resource is the same public GATK resource used by Mutect2 (when calling tumor samples, but not in Step 1, above).

Post edited by bhanuGandham on


  • yyjyyj Member

    gatk Mutect2 -R ref.fasta \
    -L intervals.interval_list \
    -I tumor.bam \
    -germline-resource af-only-gnomad.vcf \
    -pon panel_of_normals.vcf \
    --f1r2-tar-gz f1r2.tar.gz \
    -O unfiltered.vcf

    Is this new command line for tumor-only mode?
    If this is,How about Tumor with matched normal?
  • polymerasepolymerase ShanghaiMember


    When I try step 2 to create PON, there is a mistake, “A USER ERROR has occurred: Bad input: GenomicsDBImport does not support GVCFs with MNPs. MNP found at chr1:889158 in VCF”.

    How to fix this?

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin


    We have updated the tutorial with GenomicsDBImport command to exclude MNPs using --max-mnp-distance 0. Updated command as below:

    gatk GenomicsDBImport -R reference.fasta -L intervals.interval_list --genomicsdb-workspace-path pon_db --max-mnp-distance 0 -V normal1.vcf.gz -V normal2.vcf.gz -V normal3.vcf.gz

    This will fix the error.

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin
  • alanhoylealanhoyle UNC Lineberger Member

    How does one source/generate the af-only-gnomad.vcf for the PON creation and mutect2 runs?

  • ismael_Pismael_P Member

    The documentation doesn't say it but MergeMutectStats can take a .list file as input if you have a long list of .stats file.
    I think you should describe this optionin the help as it is helpfull.
  • anamariaanamaria Member

    Even though it is specified in the post header that "This tutorial is applicable to Mutect2 version and higher", it appears that MergeMutectStats does not exist in version Please correct me if I'm wrong.

  • anamariaanamaria Member

    I just realised this makes no sense and I got the versions mixed up, apologies.

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin
  • dcampodcampo Los AngelesMember


    I am trying to run GenomicsDB, but after 12 h, it exits with the following error:

    terminate called after throwing an instance of 'GenomicsDBConfigException'
    what(): GenomicsDBConfigException : Syntax error in JSON file /tmp/loader_2075782628441988593.json

    This is my command line:

    java -jar $GATK GenomicsDBImport \
    -R $REFERENCE/GRCh38.p7.genome.fa \
    -L $REFERENCE/MedExome_hg38_capture_targets.bed \
    --genomicsdb-workspace-path $VCF/PON/PON_db \
    --sample-name-map $VCF/PON/normals_for_pon_map \
    --reader-threads 16 \
    --verbosity ERROR

    These are 5 exome samples, and I'm running on one node with 16 CPUs and 64 GB of RAM in our local cluster.
    Also, the tool creates 17,400 directories, all starting with chr1. Does that mean that it is still working on the variants in chr1?? Because if so, even if I make it run, it will take days...am I doing something wrong?


  • dcampodcampo Los AngelesMember


    an update on my comment about "Syntax error in JSON file /tmp/loader_2075782628441988593.json". I specified a different temp folder, and I am not getting that error anymore. I guess the default temp folder was filling up quickly.

    Anyways, I have now the problem with run time. After 24 h, the tool creates thousands of directories, but it's still working on chr 2, so it would take weeks to finish. And our university cluster has a wall time limit of 24 h...
    My command line is the same than I posted above, except with the added "--tmp-dir=$SCRATCHDIR" option.
    I don't think this is supposed to take that long, so I assume I am doing something wrong, but can't figure out what...any help, please?


  • JulsJuls Member ✭✭
    edited July 2019


    @ New Filtering Strategy in FilterMutectCalls

    Is there a description available for the (remaining) filters?

    These are the new / edited ones, correct?


  • jcx9dyjcx9dy Member

    @dcampo I have the same issue... would love to know if you figure it out.

    After bouncing through the forums for a while, these are the links I've scrounged up:

    Some unorganized thoughts:

    • filesystem i/o read/write could really be slowing things down (these issues become more apparent on a cluster, as opposed to cloud instances that abstract away the shared resource limitations)
    • one recommendation was to scatter/parallelize the db import over many intervals (but still all samples, though perhaps with the --batch 50 option); subsequently merge the interval-specific pooled vcf somehow (how to properly do this? .. in the firecloud example?)
    • another last resort is to switch back to the old system of using CombineGVCFs which is similar to what Queue used to use?

    I vaguely remember using Queue (scala workflow instead of wdl) to first scatter intervals and samples for mutect2 tumor-only mode for normals.bam, and the final step was calling CombineVariants to yield the PON.vcf This last merge step was very fast (~10 hours total for ~40 WGS). Something must've drastically changed about the more recent merging workflows if people report needing hundreds of hours.

  • RishabhRishabh IndiaMember
    I am trying to Create a GenomicsDB from the normal Mutect2 calls
    So there are four command that i used and it pop out with different errors but now m i struck .
    1. Creating GBImport
    java -jar /PathToGatk GenomicsDBImport -R Hg19_Reference.fa -L V6.bed -V normal1.vcf.gz -V normal2.vcf.gz -V normal3.vcf.gz -V normal4.vcf.gz -V normal5.vcf.gz --genomicsdb-workspace-path pon_db
    A USER ERROR has occurred: Bad input: GenomicsDBImport does not support GVCFs with MNPs.

    2. Removing MNP from vcf.gz files
    java -jar /PathToGatk SelectVariants -R Hg19_Reference.fa --variant normal1.vcf.gz -O normal1_without_mnp.vcf.gz --xl-select-type MNP

    3. Creating GBImport without MNP in vcf.gz files
    java -jar /PathToGatk GenomicsDBImport -R Hg19_Reference.fa -L V6.bed -V normal1_without_mnp.vcf.gz -V normal2_without_mnp.vcf.gz -V normal3_without_mnp.vcf.gz -V normal4_without_mnp.vcf.gz -V normal5_without_mnp.vcf.gz --genomicsdb-workspace-path pon_db
    Duplicate field name AF found in vid attribute "fields"
    terminate called after throwing an instance of 'FileBasedVidMapperException'
    what(): FileBasedVidMapperException : Duplicate fields exist in vid attribute "fields"

    4. Using --max-mnp-distance 0 option to create GBImport
    java -jar /PathToGatk GenomicsDBImport -R Hg19_Reference.fa -L V6.bed -V normal1_without_mnp.vcf.gz -V normal2_without_mnp.vcf.gz -V normal3_without_mnp.vcf.gz -V normal4_without_mnp.vcf.gz -V normal5_without_mnp.vcf.gz --genomicsdb-workspace-path pon_db --max-mnp-distance 0
    A USER ERROR has occurred: max-mnp-distance is not a recognized option
  • jcx9dyjcx9dy Member


    4.) this page has not been corrected yet (one of the admins would need to move the --max-mnp-distance 0 to step1 instead of step2 in the main post); but from a comment in this section, the --max-mnp-distance 0 is an option for Mutect2, not for GenomicsDBImport. So if you want to use this, you should have used it when creating the normal*.vcf.gz

    I'm not sure what (3.) is ... and am also not sure that (2.) will be sufficient. You could keep researching (3.) or, the easier/safer option (which is part of the established recommendations) is to just run Mutect2 with the aforementioned parameter to remove MNP.

  • alanhoylealanhoyle UNC Lineberger Member

    Re: PON generation:

    For any genome-wide interval sets, I think you're also going to want to run GenomicsDBImport with the --merge-input-intervals option.

    Were running this on TCGA normals that we aligned to the GDC's GRCh38.d1.vd1 reference with BWA, then sorted/markdup before running through the Mutect2 artifact generation, then using Gencode exons for our --intervals.

    We're still doing testing, but a colleague found the previous version (i.e. without the GenomicsDB) to run far faster when making the PON.

  • dcampodcampo Los AngelesMember
    edited July 2019

    I did not have time to troubleshoot, so I ended up installing a previous version of GATK to avoid GenomicsDBImport. At some point I will have to come back to this and play around a bit with parallelizing, etc.
    That said, I don't believe it is a matter of run time. Like I said above. these are only 5 exome samples; if after 24 h running, the tool is still working on chr2, it means that the whole thing will take weeks to complete. It has to be something else...
    So, any help on this from the team will be greatly appreciated!

  • alanhoylealanhoyle UNC Lineberger Member

    Just as a follow-up: combining 50 TCGA normals (600k-4m variants per artifact VCF; 75.4 million data rows total) resulted in a 108 MB VCF with 7.1 million data rows (i.e. not headers). GenomicsDBImport of the 50 VCFs took 164 minutes (2.7 hr) and created a 2.8 GB pon_db, CreateSomaticPanelOfNormals took 155 minutes (2.6 hr). That's over 6 hours on some fast machines.

    If one neglects the --merge-input-intervals option, I think it runs vastly slower and outputs many more error messages in part 2.

    Are these the kinds of numbers we should expect?

  • alanhoylealanhoyle UNC Lineberger Member

    @dcampo try gatk GenomicsDBImport --merge-input-intervals ... that brought the time down to something that ran in 5-7 hours for us, but that is still far longer than I would have expected.

  • tonytony Member ✭✭
    edited July 2019

    Hi @bhanuGandham

    Regarding pon creation only,

    1. we must not provide --germline-resource in Mutect2 call of Step1 ?
    2. --max-mnp-distance 0 has to be set in Mutect2 call of Step1 ?

    Many thanks,

  • alanhoylealanhoyle UNC Lineberger Member

    @alanhoyle said:
    @dcampo try gatk GenomicsDBImport --merge-input-intervals ... that brought the time down to something that ran in 5-7 hours for us, but that is still far longer than I would have expected.

    Running the same VCFS through the older CreateSomaticPanelOfNormals runs in ~5 minutes, but the PON.vcf doesn't contain the same information. One hopes that this is worth it?

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @shenglai

    it seems max-mnp-distance should be added to the Mutect2 tumor only calling. I just wanna confirm if I understand it correctly with you

    You are right and thank you for bringing it to our notice. We have corrected this error.

  • dcampodcampo Los AngelesMember

    So I tried re-running with the --merge-input-intervals option, and it ran in literally 1 min...
    From weeks to 1 min!!!
    That option should really be emphasized in the tool description, or maybe even included by default.
    Thanks a lot!

  • syer89syer89 Member
    edited July 2019
    Hi @davidben and others, wondering if there a limitation on indel calling with Mutect2. Although it picked some validated INDELS (15bp del, 33bp ins, etc..), I have an 88bp deletion it did not. It was previously picked by HC (v3.2-) at ~14% and with v4.1.2 as well at ~9.75%. I understand there will be differences between the two callers, but it is real and not present in PON to be filtered out. Just want to make sure if this is an expected outcome (like the indel length etc..)
  • davidbendavidben BostonMember, Broadie, Dev ✭✭✭

    @syer89 indels of that size are towards the limit of what Mutect2 can handle. There's no fixed limit, just that our default assembly parameters start missing things around that size.

  • ShishiMinShishiMin Member


    gatk Mutect2 -R ref.fasta \
    -L intervals.interval_list \
    -I tumor.bam \
    -germline-resource af-only-gnomad.vcf \
    -pon panel_of_normals.vcf \
    --f1r2-tar-gz f1r2.tar.gz \
    -O unfiltered.vcf

    This new command line is for tumor-only mode, But i want know the command for tumor and matched normal mode. Is the following command right? Let me know if I'm wrong. Thank you! What'more I am confused about that FilterByOrientationBias is only needs to be run on the tumor sample(s). Do I still need --f1r2-tar-gz argument in tumor and matched normal mode?

    gatk Mutect2 \
    -R reference.fa \
    -I tumor.bam \
    -I normal.bam \
    -normal normal_sample_name \
    --germline-resource af-only-gnomad.vcf.gz \
    -pon panel_of_normals.vcf \
    --f1r2-tar-gz f1r2.tar.gz \
    -O unfiltered.vcf

  • davidbendavidben BostonMember, Broadie, Dev ✭✭✭

    @ShishiMin your tumor-normal mode command line is correct. This generates raw F1R2 count data for just the tumor sample, which you then process in LearnReadOrientationModel, the output of which goes to FilterMutectCalls.

  • syer89syer89 Member
    Hi @davidben , while using GATK4 (v1.4.2.0, mutect2/tumour mode) and testing for reproducibility it seems to vary in terms of #no of variants and annotations. Is that observed and any fixed seed option to use?
  • davidbendavidben BostonMember, Broadie, Dev ✭✭✭

    @syer89 There's no fixed seed yet but that's a really good idea. I'll try to do that soon.

  • syer89syer89 Member
    Thank you @davidben. Any estimate of when the fix will be available? Also, is there documentation to better understand the filteringStats.tsv generated by FilterMutect calls?
  • baltimoreoriolebaltimoreoriole baltimoreMember

    Hi, I spent some good amount of time to understand GATK tools including Mutect2. I feel this is a maze of blogs, deprecated pages and variety of options. Could you please help me focus on the tools and steps required. Following is my situation:
    1. I have 100 tumor samples with no normal germline samples.
    2. Sequencing was done for 4 genes using libraries from Fluidigm followed by Illumina sequencer in a paired end mode.
    3. I aligned reads -> SAM -> BAM -> picard fixmate -> picard reorder sam -> GATK base recalibration -> MuTect2 (GATK3)
    java -jar $GATK -T MuTect2 \
    -R ucsc.hg19.fasta \
    -I:tumor Tumor1.bam \
    --dbsnp hg19-All_chr.vcf \
    --cosmic b37_cosmic_v54_120711_chr.vcf \
    -L MyTargetGenes.bed \
    -o Tumor1.vcf

    Is there a step-wise guide for Tumors samples only GATK4 Mutect2. Also it is confusing how to generate some of the input files that are described here. For example -
    -germline-resource af-only-gnomad.vcf \
    -pon panel_of_normals.vcf \
    --f1r2-tar-gz f1r2.tar.gz \
    Where can I find af-only-gnmoad.vcf? Where is 'panel of normals' pon panel_of_normals.vcf.
    I checked at the following FTP site. From the wide array of choices I am lost and I cannot find which one to use as panel_of_normals.vcf. (ftp://[email protected]/bundle/hg19/ )

    Could you please help me find a workflow that is best for identifying mutations in each patient tumor sample (100 samples) and with no matching germline normals.

    Thank you so much and appreciate your time helping me.

  • davidbendavidben BostonMember, Broadie, Dev ✭✭✭

    @syer89 I have what I think is a fix but I'm having trouble reproducing the error. Could you post a few examples of sites that change? Do you have a sense of which annotations change? Also, could you post the filtering stats from two runs of the same pipeline that gave different outputs?

    There is currently no documentation for the filtering stats, but, briefly, it has these three parts:

    1) estimated somatic mutation rates (natural log scale) for SNVs and indels of different lengths
    2) summary of the beta mixture model of allele fraction clustering
    3) estimated false positive and false negative rate attributable to each filter

    We may document it more in the future but it's a new feature that for now we intend mainly for our own debugging.

  • davidbendavidben BostonMember, Broadie, Dev ✭✭✭

    @baltimoreoriole The information here: https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_hellbender_tools_walkers_mutect_Mutect2.php is up-to-date and includes tumor-only calling.

    Additionally, our developer docs: https://github.com/broadinstitute/gatk/blob/master/docs/mutect/mutect.pdf are always up-to-date and contain command lines. A complete pipeline would include the Mutect2 command on page 1, (optionally) the LearnReadOrientationModel command on page 8, the GetPileupSummaries and CalculateContamination commands on page 10, and the FilterMutectCalls command on page 3.

    You do not need to make you own panel of normals (unless you have a huge number of samples, it may even be counterproductive than our generic public panel). Instead you may use gs://gatk-best-practices/somatic-b37/Mutect2-exome-panel.vcf.

    For the -germline-resource you should use gs://gatk-best-practices/somatic-b37/af-only-gnomad.raw.sites.vcf.

    For the -V and -L arguments to GetPileupSummaries you may use gs://gatk-best-practices/somatic-b37/small_exac_common_3.vcf.

    Finally, --f1r2-tar-gz is an output, not an input, of Mutect2.

  • baltimoreoriolebaltimoreoriole baltimoreMember

    Hi David,
    Thanks for your help in pointing me to correct site and resource.

    While running gatk ver4 Mutect2 I get this error:

    gatk --java-options "-Xmx24G" Mutect2 \
    -R ucsc.hg19.fasta \
    -L Mygenes.bed \
    -I tumor.bam \
    -germline-resource af-only-gnomad.raw.sites.hg19.vcf.gz \
    -pon Mutect2-exome-panel.vcf \
    --f1r2-tar-gz f1r2.tar.gz \
    -O unfiltered.vcf

    also tried:
    gatk Mutect2 \
    -R ucsc.hg19.fasta \
    -L Mygenes.bed \
    -I tumor.bam \
    -germline-resource af-only-gnomad.raw.sites.hg19.vcf.gz \
    -pon Mutect2-exome-panel.vcf \
    --f1r2-tar-gz f1r2.tar.gz \
    -O unfiltered.vcf

    I get an error that f1r2-tar-gz is not a recognized option

    What could be wrong.


    USAGE: Mutect2 [arguments]

    Call somatic SNVs and indels via local assembly of haplotypes

    Required Arguments:
    xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx (xxx -means printing all the options which I am not pasting here)..

    A USER ERROR has occurred: f1r2-tar-gz is not a recognized option

    Set the system property GATK_STACKTRACE_ON_USER_EXCEPTION (--java-options '-DGATK_STACKTRACE_ON_USER_EXCEPTION=true') to print the stack trace.

  • davidbendavidben BostonMember, Broadie, Dev ✭✭✭

    @baltimoreoriole You will need to use a more recent version than 4.1. The current version,, is significantly more advanced.

  • 29043594952904359495 Member ✭✭

    when will publish, because contamination not good for small panel, thanks a lot

  • syer89syer89 Member
    edited September 2019

    @davidben, my observation is that 70% of the variant differences are related to "panel of normals". Below is one sample rerun example (bwa-mem + markdup is the starting bam) using GATK_v4.1.2.0. I also attached the filtering stats files here.

    1st re-run:
    (FN variants - basically not found in the 2nd re-run)
    4 1961206 . GCC TGA . base_qual;clustered_events;panel_of_normals;strand_bias CONTQ=18;DP=445;ECNT=5;GERMQ=93;MBQ=32,13;MFRL=161,97;MMQ=60,60;MPOS=22;PON;POPAF=7.3
    ;ROQ=93;SEQQ=16;STRANDQ=1;TLOD=6.66 GT:AD:AF:DP:F1R2:F2R1:SB 0/1:438,7:0.017:445:144,1:230,0:368,70,0,7

    19 11136038 . A C . base_qual;clustered_events;panel_of_normals;weak_evidence CONTQ=89;DP=162;ECNT=6;GERMQ=93;MBQ=28,12;MFRL=186,187;MMQ=60,60;MPOS
    =13;PON;POPAF=7.3;ROQ=15;SEQQ=1;STRANDQ=4;TLOD=3.3 GT:AD:AF:DP:F1R2:F2R1:SB 0/1:133,10:0.044:143:51,3:27,1:128,5,10,0

    2nd re-run:
    (FP variant - occurred in this re-run)
    14 106324729 . G C . base_qual;panel_of_normals CONTQ=93;DP=83;ECNT=2;GERMQ=53;MBQ=20,15;MFRL=129,229;MMQ=60,60;MPOS=12;PON;POPAF=7.30;ROQ=26;SEQQ=17;STRANDQ=12;TLOD=6.92 GT:AD:AF:DP:F1R2:F2R1:SB 0/1:64,14:0.149:78:13,5:21,1:36,28,2,12

    In terms of annotations, filters like CONTQ, GERMQ, T_LOD values (and more?) vary between the reruns which I guess in turn will affect the filter flags being annotated. For example, the same variant in different re-runs -

    1st re-run:
    14 106146495 . T TG . haplotype CONTQ=93;DP=68;ECNT=2;GERMQ=5;MBQ=32,32;MFRL=209,189;MMQ=60,60;MPOS=20;POPAF=7.3;ROQ=93;RPA=1,2;RU=G;SEQQ=93;STR;STRANDQ=9;STRQ=93;TLOD=72.8 GT:AD:AF:DP:F1R2:F2R1:PGT:PID:PS:SB 0|1:37,21:0.366:58:15,9:17,10:0|1:106146495_T_TG:106146495:2,35,0,21

    19 1612476 . T G . PASS CONTQ=93;DP=174;ECNT=2;GERMQ=93;MBQ=31,22;MFRL=174,195;MMQ=60,60;MPOS=10;POPAF=7.3;ROQ=37;SEQQ=62;STRANDQ=3;TLOD=11.08 GT:AD:AF:DP:F1R2:F2R1:SB 0/1:151,13:0.065:164:52,3:62,4:53,98,1,12

    2nd re-run:
    14 106146495 . T TG . germline;haplotype CONTQ=93;DP=68;ECNT=2;GERMQ=3;MBQ=32,32;MFRL=209,189;MMQ=60,60;MPOS=20;POPAF=7.3;ROQ=93;RPA=1,2;RU=G;SEQQ=93;STR;STRANDQ=9;STRQ=93;TLOD=72.8 GT:AD:AF:DP:F1R2:F2R1:PGT:PID:PS:SB 0|1:37,21:0.366:58:15,9:17,10:0|1:106146495_T_TG:106146495:2,35,0,21

    19 1612476 . T G . strand_bias CONTQ=93;DP=174;ECNT=2;GERMQ=93;MBQ=31,22;MFRL=174,195;MMQ=60,60;MPOS=10;POPAF=7.3;ROQ=37;SEQQ=62;STRANDQ=3;TLOD=11.08 GT:AD:AF:DP:F1R2:F2R1:SB 0/1:151,13:0.065:164:52,3:62,4:53,98,1,12

    I hope this is helpful and pls let me know if you need any additional information from me.

  • ShishiMinShishiMin Member

    where can i download this file gs://gatk-best-practices/somatic-b37/Mutect2-exome-panel.vcf.?

  • ShishiMinShishiMin Member

    @davidben I found it. Thank you.

  • jamespjamesp Member

    Where does Mutect2 put the .stats file? I can't find it to delocalize for next steps like filtering. I expected it to be at (in my example) unfiltered.vcf.stats, but that file isn't produced. (Running this with the Dockerized GATK.)

    Command line:

    /gatk/gatk --java-options "-Xmx${RAM}G" \
        Mutect2 \
        --input ${cram} \
        --reference ${REFERENCE_FASTA} \
        --output unfiltered.vcf

    GATK library info:

    10:06:22.798 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/gatk/gatk-package-!/com/intel/gkl/native/libgkl_compression.so
    10:06:24.107 INFO  Mutect2 - ------------------------------------------------------------
    10:06:24.108 INFO  Mutect2 - The Genome Analysis Toolkit (GATK) v4.1.4.0
    10:06:24.108 INFO  Mutect2 - For support and documentation go to https://software.broadinstitute.org/gatk/
    10:06:24.109 INFO  Mutect2 - Executing as [email protected] on Linux v4.19.80+ amd64
    10:06:24.114 INFO  Mutect2 - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_212-8u212-b03-0ubuntu1.16.04.1-b03
    10:06:24.114 INFO  Mutect2 - Start Date/Time: November 9, 2019 10:06:22 AM UTC
    10:06:24.114 INFO  Mutect2 - ------------------------------------------------------------
    10:06:24.115 INFO  Mutect2 - ------------------------------------------------------------
    10:06:24.115 INFO  Mutect2 - HTSJDK Version: 2.20.3
    10:06:24.116 INFO  Mutect2 - Picard Version: 2.21.1
    10:06:24.116 INFO  Mutect2 - HTSJDK Defaults.COMPRESSION_LEVEL : 2
    10:06:24.116 INFO  Mutect2 - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
    10:06:24.116 INFO  Mutect2 - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
    10:06:24.116 INFO  Mutect2 - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
    12:16:20.925 INFO  ProgressMeter - Un_KN707838v1_decoy:901            129.9              10752010          82766.9
    12:16:30.929 INFO  ProgressMeter - Un_KN707896v1_decoy:301            130.1              10752790          82666.8
    12:16:40.934 INFO  ProgressMeter - Un_KN707906v1_decoy:996            130.2              10753100          82563.4
    12:16:51.217 INFO  ProgressMeter - Un_KN707951v1_decoy:1060            130.4              10753580          82458.5
    12:17:01.258 INFO  ProgressMeter - Un_JTFH01000012v1_decoy:1801            130.6              10754880          82362.8
    Using GATK jar /gatk/gatk-package-

    But then after the run completes, only the unfiltered.vcf file is present, no unfiltered.vcf.stats file is seen:

    #List files in current path:
  • kchang3kchang3 Member
    1. Regarding the PON workflow above, is it still outputting germline variants as default?
    2. Does filterMutect step uses PON as a hard filter in version, such that if a variant is seen PON then it's excluded ?
  • deena_bdeena_b Member
    edited November 2019

    Would someone please post a link to the doc page for MergeMutectStats?
    Thank you :smile:

  • davidbendavidben BostonMember, Broadie, Dev ✭✭✭

    @jamesp Does the output show M2 failing somewhere? That could cause it to produce a VCF but not a stats file.

  • davidbendavidben BostonMember, Broadie, Dev ✭✭✭

    @kchang3 If you do not give CreateSomaticPanelOfNormal a --germline-resource input then the PoN will contain germline variants.

    FilterMutectCalls uses the PoN as a hard filter.

  • kchang3kchang3 Member

    @davidben I'm running gatk- Mutect2 on mice wes tumor and matched normal data with pon as described above. I noticed a few tumor-normal pairs have > 90% mutations filtered due to normal artifact. Could the --normal-p-value-threshold be too stringent for these pairs? How should I go about understanding why these samples have such high normal artifact filtered mutations?

Sign In or Register to comment.