Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

(How to) Call somatic mutations using GATK4 Mutect2

bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin
edited July 18 in Tutorials

This tutorial is applicable to Mutect2 version 4.1.1.0 and higher. Post suggestions/questions in the Ask the GATK team section.


GATK 4.1.1.0 contains several major advances in the Mutect2 pipeline, which is good, but we have had to change command lines in a few places, which we usually try to avoid. Here are the major changes.

Bug Fixes

We fixed several bugs with error messages about invalid log probabilities, infinities, NaNs etc that were introduced by neglecting to account for finite precision errors. We also resolved an issue where CalculateContamination worked poorly on very small gene panels.

New Required Inputs to FilterMutectCalls

1) FilterMutectCalls now requires a reference, which should be the same fasta file input to Mutect2.
2) FilterMutectCalls also requires a stats file from Mutect2. When you run Mutect2 with output -O unfiltered.vcf, for example, it produces a file called unfiltered.vcf.stats. FilterMutectCalls automatically looks for this file, so if you are running locally no change is needed. That is,

gatk Mutect2 -R ref.fasta -I tumor.bam -O unfiltered.vcf

followed by

gatk FilterMutectCalls -R ref.fasta -V unfiltered.vcf -O filtered.vcf

works because Mutect2 creates unfiltered.vcf.stats behind the scenes and FilterMutectCalls knows to look for it. However, if you are running on a cluster or the cloud you need to keep track of the stats file. For example, you need to delocalize it from a VM, as is done in the Mutect2 WDL. You can explicitly input the stats file with the -stats argument in FilterMutectCalls. If you are scattering Mutect2 over multiple nodes you must merge the stats files with MergeMutectStats:

gatk MergeMutectStats -stats unfiltered1.vcf.stats -stats unfiltered2.vcf.stats -O merged.stats

and pass merged.stats to FilterMutectCalls.

New Filtering Strategy in FilterMutectCalls

FilterMutectCalls now filters based on a single quantity, the probability that a variant is a somatic mutation. Previously, different reasons why this might not be the case each had their own thresholds. We have removed parameters such as -normal-artifact-lod, -max-germline-posterior, -max-strand-artifact-probability, and -max-contamination-probability. Even the previously fundamental -tumor-lod is gone. Rather than replace these with a single threshold for the error probability, FilterMutectCalls replaces them with nothing at all. It automatically determines the threshold that optimizes the "F score", the harmonic mean of sensitivity and precision. In order to tweak results in favor of more sensitivity users may set -f-score-beta to a value greater than its default of 1 (beta is the relative weight of sensitivity versus precision in the harmonic mean). Setting it lower biases results toward greater precision.

You can think of these changes as doing two things. Unifying filtering thresholds puts all the filters at the same place on a ROC curve of sensitivity vs precision. Previously, one threshold might be sacrificing a lot of sensitivity for a small gain in precision while another might be doing the opposite, the result being poor sensitivity and precision that fell below the potential ROC curve. Once we’re on that ROC curve, we achieve a good balance between sensitivity and precision by optimizing the F score.

New Somatic Clustering Model in FilterMutectCalls

We had long suspected that modeling the spectrum of subclonal allele fractions would help distinguish somatic variants from errors. For example, if every somatic variant in a tumor were a het occurring in 40% of cells, we would know to reject anything with an allele fraction significantly different from 20%. In the Bayesian framework of Mutect2 this means that the likelihood for somatic variants is given by a binomial distribution. We account for an unknown number of subclones with a Dirichlet process binomial mixture model. This is still an oversimplification because CNVs, small subclones, and genetic drift of passenger mutations all contribute allele fractions that don’t match a few discrete values. Therefore, we include a couple of beta-binomials in the mixture to account for a background spread of allele fractions while still benefiting from clustering. Finally, we use these binomial and beta-binomial likelihoods to refine the tumor log odds calculated by Mutect2, which assume a uniform distribution of allele fractions.

In addition to clustering allele fractions we also learn the overall prior probabilities of somatic SNVs and indels so that we can be more skeptical of apparent variants in a quiet tumor, for example. We learn the parameters of this model with a stochastic EM approach, where the E steps consist of Chinese Restaurant Process sampling of the allele fraction clusters. In case you were wondering, we have tested this new approach on some of the off-label, non-cancer uses of Mutect2, such as mitochondria, and it works very well.
FilterMutectCalls reports the learned parameters of somatic clustering in a new .filtering_stats.tsv output. This file also contains information on the probability threshold chosen to optimize the F score and the number of false positives and false negatives from each filter to be expected from this choice.

A step-by-step guide to the new Mutect2 Read Orientation Artifacts Workflow

If you suspect any of your samples of substitution errors that occur on a single strand before sequencing you should definitely use Mutect2's orientation bias filter. This applies to all FFPE tumor samples and samples sequenced on Illumina Novaseq machines, among others. In fact, with the optimizations in 4.1.1.0 you can run the filter even when you're not suspicious. It won't hurt accuracy and the CPU cost is now quite small.

There are three steps to the filter. First, run Mutect2 with the --f1r2-tar-gz argument. This creates an output with raw data used to learn the orientation bias model. Previously this was done by CollectF1R2Counts. By absorbing it into Mutect2, we eliminated the cost of CollectF1R2Counts with almost no effect on the runtime of Mutect2. When multiple tumor samples are specified, you only need a single --f1r2-tar-gz output, which contains data for each tumor sample.

gatk Mutect2 -R ref.fasta \
   -L intervals.interval_list \
   -I tumor.bam \
   -germline-resource af-only-gnomad.vcf \
   -pon panel_of_normals.vcf   \
   --f1r2-tar-gz f1r2.tar.gz \
   -O unfiltered.vcf

Next, pass this raw data to LearnReadOrientationModel:

gatk LearnReadOrientationModel -I f1r2.tar.gz -O read-orientation-model.tar.gz

Run GetPileupSummaries to summarize read support for a set number of known variant sites.

gatk GetPileupSummaries \
    -I tumor.bam \
    -V chr17_small_exac_common_3_grch38.vcf.gz \
    -L chr17_small_exac_common_3_grch38.vcf.gz \
    -O getpileupsummaries.table

Estimate contamination with CalculateContamination.

gatk CalculateContamination \
    -I getpileupsummaries.table \
    -tumor-segmentation segments.table \
    -O calculatecontamination.table

Finally, pass the learned read orientation model to FilterMutectCallswith the -ob-priors argument:

gatk FilterMutectCalls -V unfiltered.vcf \
   [--tumor-segmentation segments.table] \
   [--contamination-table contamination.table] \
   --ob-priors read-orientation-model.tar.gz \
   -O filtered.vcf

Advanced note: if you are scattering Mutect2 over nodes in a cluster or on the cloud, you must input the --f1r2-tar-gz output from each Mutect2 scatter to LearnReadOrientationModel. This is done automatically in the Mutect2 wdl in the gatk repo and on Terra. For example:

for chromosome in {1..22}; do
gatk Mutect2 -R ref.fasta -I tumor.bam -L $chromosome --f1r2-tar-gz ${chromosome}-f1r2.tar.gz -O ${chromosome}-unfiltered.vcf``
done
all_f1r2_input=`for chromosome in {1..22}; do printf -- "-I ${chromosome}-f1r2.tar.gz "; done`
LearnReadOrientationModel $all_f1_r2_input -O read-orientation-model.tar.gz

A step-by-step guide to the new Mutect2 Panel of Normals Workflow

We rewrote CreateSomaticPanelOfNormals to use GenomicsDB for scalability. We also added the INFO fields FRACTION (the fraction of normal samples with an artifact) and BETA (the shape parameters of the beta distribution of artifact allele fractions among samples with the artifact) to the panel of normals. FilterMutectCalls doesn’t use these yet, but we hope to experiment with them in the near future. Furthermore, we never liked how germline variants, which are handled in a more principled way with our germline filter, ended up as hard-filtered pon sites, so the panel of normals workflow now optionally excludes germline events from its output, keeping only technical artifacts.

The three steps to create a panel of normals are:

Step 1: Run Mutect2 in tumor-only mode for each normal sample:

gatk Mutect2 -R reference.fasta -I normal1.bam   --max-mnp-distance 0 -O normal1.vcf.gz
gatk Mutect2 -R reference.fasta -I normal2.bam   --max-mnp-distance 0 -O normal2.vcf.gz
Etc

Step 2: Create a GenomicsDB from the normal Mutect2 calls:

gatk GenomicsDBImport -R reference.fasta -L intervals.interval_list \
  --genomicsdb-workspace-path pon_db \
  -V normal1.vcf.gz \
  -V normal2.vcf.gz \
  -V normal3.vcf.gz

Step 3: Combine the normal calls using CreateSomaticPanelOfNormals:

gatk CreateSomaticPanelOfNormals -R reference.fasta \
  --germline-resource af-only-gnomad.vcf.gz \
  -V gendb://pon_db \
  -O pon.vcf.gz

In the third step, the AF-only gnomAD resource is the same public GATK resource used by Mutect2 (when calling tumor samples, but not in Step 1, above).

Post edited by bhanuGandham on

Comments

  • yyjyyj Member
    Hi,@bhanuGandham

    gatk Mutect2 -R ref.fasta \
    -L intervals.interval_list \
    -I tumor.bam \
    -germline-resource af-only-gnomad.vcf \
    -pon panel_of_normals.vcf \
    --f1r2-tar-gz f1r2.tar.gz \
    -O unfiltered.vcf

    Is this new command line for tumor-only mode?
    If this is,How about Tumor with matched normal?
  • polymerasepolymerase ShanghaiMember

    Hi,@bhanuGandham

    When I try step 2 to create PON, there is a mistake, “A USER ERROR has occurred: Bad input: GenomicsDBImport does not support GVCFs with MNPs. MNP found at chr1:889158 in VCF”.

    How to fix this?

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    @polymerase

    We have updated the tutorial with GenomicsDBImport command to exclude MNPs using --max-mnp-distance 0. Updated command as below:

    gatk GenomicsDBImport -R reference.fasta -L intervals.interval_list --genomicsdb-workspace-path pon_db --max-mnp-distance 0 -V normal1.vcf.gz -V normal2.vcf.gz -V normal3.vcf.gz

    This will fix the error.

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin
  • alanhoylealanhoyle UNC Lineberger Member

    How does one source/generate the af-only-gnomad.vcf for the PON creation and mutect2 runs?

  • ismael_Pismael_P Member
    Hello,

    The documentation doesn't say it but MergeMutectStats can take a .list file as input if you have a long list of .stats file.
    I think you should describe this optionin the help as it is helpfull.
  • anamariaanamaria Member

    Even though it is specified in the post header that "This tutorial is applicable to Mutect2 version 4.1.1.0 and higher", it appears that MergeMutectStats does not exist in version 4.1.0.0. Please correct me if I'm wrong.

  • anamariaanamaria Member

    I just realised this makes no sense and I got the versions mixed up, apologies.

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin
  • dcampodcampo Los AngelesMember

    Hello!

    I am trying to run GenomicsDB, but after 12 h, it exits with the following error:

    terminate called after throwing an instance of 'GenomicsDBConfigException'
    what(): GenomicsDBConfigException : Syntax error in JSON file /tmp/loader_2075782628441988593.json

    This is my command line:

    java -jar $GATK GenomicsDBImport \
    -R $REFERENCE/GRCh38.p7.genome.fa \
    -L $REFERENCE/MedExome_hg38_capture_targets.bed \
    --genomicsdb-workspace-path $VCF/PON/PON_db \
    --sample-name-map $VCF/PON/normals_for_pon_map \
    --reader-threads 16 \
    --verbosity ERROR

    These are 5 exome samples, and I'm running on one node with 16 CPUs and 64 GB of RAM in our local cluster.
    Also, the tool creates 17,400 directories, all starting with chr1. Does that mean that it is still working on the variants in chr1?? Because if so, even if I make it run, it will take days...am I doing something wrong?

    Thanks!

  • dcampodcampo Los AngelesMember

    Hi,

    an update on my comment about "Syntax error in JSON file /tmp/loader_2075782628441988593.json". I specified a different temp folder, and I am not getting that error anymore. I guess the default temp folder was filling up quickly.

    Anyways, I have now the problem with run time. After 24 h, the tool creates thousands of directories, but it's still working on chr 2, so it would take weeks to finish. And our university cluster has a wall time limit of 24 h...
    My command line is the same than I posted above, except with the added "--tmp-dir=$SCRATCHDIR" option.
    I don't think this is supposed to take that long, so I assume I am doing something wrong, but can't figure out what...any help, please?

    Thanks!

  • JulsJuls Member ✭✭
    edited July 1

    Hi,

    @ New Filtering Strategy in FilterMutectCalls

    Is there a description available for the (remaining) filters?

    These are the new / edited ones, correct?
    weak_evidence
    slippage
    haplotype
    germline

    Thanks!

  • jcx9dyjcx9dy Member

    @dcampo I have the same issue... would love to know if you figure it out.

    After bouncing through the forums for a while, these are the links I've scrounged up:

    Some unorganized thoughts:

    • filesystem i/o read/write could really be slowing things down (these issues become more apparent on a cluster, as opposed to cloud instances that abstract away the shared resource limitations)
    • one recommendation was to scatter/parallelize the db import over many intervals (but still all samples, though perhaps with the --batch 50 option); subsequently merge the interval-specific pooled vcf somehow (how to properly do this? .. in the firecloud example?)
    • another last resort is to switch back to the old system of using CombineGVCFs which is similar to what Queue used to use?

    I vaguely remember using Queue (scala workflow instead of wdl) to first scatter intervals and samples for mutect2 tumor-only mode for normals.bam, and the final step was calling CombineVariants to yield the PON.vcf This last merge step was very fast (~10 hours total for ~40 WGS). Something must've drastically changed about the more recent merging workflows if people report needing hundreds of hours.

  • RishabhRishabh IndiaMember
    @bhanuGandham
    I am trying to Create a GenomicsDB from the normal Mutect2 calls
    So there are four command that i used and it pop out with different errors but now m i struck .
    1. Creating GBImport
    java -jar /PathToGatk GenomicsDBImport -R Hg19_Reference.fa -L V6.bed -V normal1.vcf.gz -V normal2.vcf.gz -V normal3.vcf.gz -V normal4.vcf.gz -V normal5.vcf.gz --genomicsdb-workspace-path pon_db
    A USER ERROR has occurred: Bad input: GenomicsDBImport does not support GVCFs with MNPs.

    2. Removing MNP from vcf.gz files
    java -jar /PathToGatk SelectVariants -R Hg19_Reference.fa --variant normal1.vcf.gz -O normal1_without_mnp.vcf.gz --xl-select-type MNP

    3. Creating GBImport without MNP in vcf.gz files
    java -jar /PathToGatk GenomicsDBImport -R Hg19_Reference.fa -L V6.bed -V normal1_without_mnp.vcf.gz -V normal2_without_mnp.vcf.gz -V normal3_without_mnp.vcf.gz -V normal4_without_mnp.vcf.gz -V normal5_without_mnp.vcf.gz --genomicsdb-workspace-path pon_db
    Duplicate field name AF found in vid attribute "fields"
    terminate called after throwing an instance of 'FileBasedVidMapperException'
    what(): FileBasedVidMapperException : Duplicate fields exist in vid attribute "fields"
    Aborted

    4. Using --max-mnp-distance 0 option to create GBImport
    java -jar /PathToGatk GenomicsDBImport -R Hg19_Reference.fa -L V6.bed -V normal1_without_mnp.vcf.gz -V normal2_without_mnp.vcf.gz -V normal3_without_mnp.vcf.gz -V normal4_without_mnp.vcf.gz -V normal5_without_mnp.vcf.gz --genomicsdb-workspace-path pon_db --max-mnp-distance 0
    A USER ERROR has occurred: max-mnp-distance is not a recognized option
  • jcx9dyjcx9dy Member

    @Rishabh

    4.) this page has not been corrected yet (one of the admins would need to move the --max-mnp-distance 0 to step1 instead of step2 in the main post); but from a comment in this section, the --max-mnp-distance 0 is an option for Mutect2, not for GenomicsDBImport. So if you want to use this, you should have used it when creating the normal*.vcf.gz

    I'm not sure what (3.) is ... and am also not sure that (2.) will be sufficient. You could keep researching (3.) or, the easier/safer option (which is part of the established recommendations) is to just run Mutect2 with the aforementioned parameter to remove MNP.

  • alanhoylealanhoyle UNC Lineberger Member

    Re: PON generation:

    For any genome-wide interval sets, I think you're also going to want to run GenomicsDBImport with the --merge-input-intervals option.

    Were running this on TCGA normals that we aligned to the GDC's GRCh38.d1.vd1 reference with BWA, then sorted/markdup before running through the Mutect2 artifact generation, then using Gencode exons for our --intervals.

    We're still doing testing, but a colleague found the previous version (i.e. without the GenomicsDB) to run far faster when making the PON.

  • dcampodcampo Los AngelesMember
    edited July 9

    @jcx9dy
    I did not have time to troubleshoot, so I ended up installing a previous version of GATK to avoid GenomicsDBImport. At some point I will have to come back to this and play around a bit with parallelizing, etc.
    That said, I don't believe it is a matter of run time. Like I said above. these are only 5 exome samples; if after 24 h running, the tool is still working on chr2, it means that the whole thing will take weeks to complete. It has to be something else...
    So, any help on this from the team will be greatly appreciated!
    Thanks!

  • alanhoylealanhoyle UNC Lineberger Member

    Just as a follow-up: combining 50 TCGA normals (600k-4m variants per artifact VCF; 75.4 million data rows total) resulted in a 108 MB VCF with 7.1 million data rows (i.e. not headers). GenomicsDBImport of the 50 VCFs took 164 minutes (2.7 hr) and created a 2.8 GB pon_db, CreateSomaticPanelOfNormals took 155 minutes (2.6 hr). That's over 6 hours on some fast machines.

    If one neglects the --merge-input-intervals option, I think it runs vastly slower and outputs many more error messages in part 2.

    Are these the kinds of numbers we should expect?

  • alanhoylealanhoyle UNC Lineberger Member

    @dcampo try gatk GenomicsDBImport --merge-input-intervals ... that brought the time down to something that ran in 5-7 hours for us, but that is still far longer than I would have expected.

  • tonytony Member ✭✭
    edited July 15

    Hi @bhanuGandham

    Regarding pon creation only,

    1. we must not provide --germline-resource in Mutect2 call of Step1 ?
    2. --max-mnp-distance 0 has to be set in Mutect2 call of Step1 ?

    Many thanks,
    Anthony

  • alanhoylealanhoyle UNC Lineberger Member

    @alanhoyle said:
    @dcampo try gatk GenomicsDBImport --merge-input-intervals ... that brought the time down to something that ran in 5-7 hours for us, but that is still far longer than I would have expected.

    Running the same VCFS through the older CreateSomaticPanelOfNormals runs in ~5 minutes, but the PON.vcf doesn't contain the same information. One hopes that this is worth it?

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @shenglai

    it seems max-mnp-distance should be added to the Mutect2 tumor only calling. I just wanna confirm if I understand it correctly with you

    You are right and thank you for bringing it to our notice. We have corrected this error.

  • dcampodcampo Los AngelesMember

    @alanhoyle
    So I tried re-running with the --merge-input-intervals option, and it ran in literally 1 min...
    From weeks to 1 min!!!
    That option should really be emphasized in the tool description, or maybe even included by default.
    Thanks a lot!

  • syer89syer89 Member
    edited July 26
    Hi @davidben and others, wondering if there a limitation on indel calling with Mutect2. Although it picked some validated INDELS (15bp del, 33bp ins, etc..), I have an 88bp deletion it did not. It was previously picked by HC (v3.2-) at ~14% and with v4.1.2 as well at ~9.75%. I understand there will be differences between the two callers, but it is real and not present in PON to be filtered out. Just want to make sure if this is an expected outcome (like the indel length etc..)
  • davidbendavidben BostonMember, Broadie, Dev ✭✭✭

    @syer89 indels of that size are towards the limit of what Mutect2 can handle. There's no fixed limit, just that our default assembly parameters start missing things around that size.

Sign In or Register to comment.