We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

(How to) Call somatic mutations using GATK4 Mutect2 (Deprecated)

shleeshlee CambridgeMember, Broadie ✭✭✭✭✭
edited May 2019 in Tutorials

This tutorial is now deprecated and only valid for Mutect2 v4.1.0.0 and lower. For Mutect2 v4.1.1.0 and higher, please refer to this tutorial.

Post suggestions and read about updates in the Comments section.

imageThis tutorial introduces researchers to considerations in somatic short variant discovery using GATK4 Mutect2. Example data are based on a breast cancer cell line and its matched normal cell line derived from blood and are aligned to GRCh38 with post-alt processing [1]. The tutorial focuses on how to call traditional somatic short mutations, as described in Article#11127 and pipelined in GATK v4.0.0.0's mutect2.wdl [2]. The tool and its workflow are in BETA status as of this writing, which means they may undergo changes and are not guaranteed for production.

► For Broad Mutation Calling Best Practices, see FireCloud Article#45055.

Section 1 calls somatic mutations with Mutect2 using all the bells and whistles of the tool. Section 2 outlines how to create the panel of normals resource using the tumor-only mode of Mutect2. Section 3 outlines how to estimate cross-sample contamination. Section 4 shows how to filter the callset with FilterMutectCalls. Unlike GATK3, in GATK4 the somatic calling and filtering functionalities are embodied by separate tools. Section 5 shows an optional filtering step to filter by sequence context artifacts that present with orientation bias, e.g. OxoG artifacts. Section 6 shows how to set up in IGV for manual review. Finally, section 7 provides a brief list of related resources that may be of interest to researchers.

GATK4 Mutect2 is a versatile variant caller that not only is more sensitive than, but is also roughly twice as fast as, HaplotypeCaller's reference confidence mode. Researchers who wish to customize analyses should find the tutorial's descriptions of the multiple levers of Mutect2 in section 1 and descriptions of the tumor-only mode of Mutect2 in section 2 of interest.

Jump to a section

  1. Call somatic short variants and generate a bamout with Mutect2
    1.1 What are the Mutect2 annotations?
    1.2 What is the impact of disabling the MateOnSameContigOrNoMappedMateReadFilter read filter?
  2. Create a sites-only PoN with CreateSomaticPanelOfNormals
    2.1 The tumor-only mode of Mutect2 is useful outside of pon creation
  3. Estimate cross-sample contamination using GetPileupSummaries and CalculateContamination
    3.1 What if I find high levels of contamination?
  4. Filter for confident somatic calls using FilterMutectCalls
  5. (Optional) Estimate artifacts with CollectSequencingArtifactMetrics and filter them with FilterByOrientationBias
    5.1 Tally of applied filters for the tutorial data
  6. Set up in IGV to review somatic calls
  7. Related resources

Tools involved

  • GATK v4.0.0.0 is available in a Docker image and as a standalone jar. For the latest release, see the Downloads page. Note that GATK v4.0.0.0 contains Picard tools from release v2.17.2 that are callable with the gatk launch script.
  • Desktop IGV. The tutorial uses v2.3.97.

Download example data

Download tutorial_11136.tar.gz, either from the GoogleDrive or from the ftp site. To access the ftp site, leave the password field blank. If the GoogleDrive link is broken, please let us know. The tutorial also requires the GRCh38 reference FASTA, dictionary and index. These are available from the GATK Resource Bundle. For details on the example data and resources, see [3] and [4].

► The tutorial steps switch between the subset and full data. Some of the data files, e.g. BAMs, are restricted to a small region of the genome to efficiently pace the tutorial. Other files, e.g. the Mutect2 calls that the tutorial filters, are from the entire genome. The tutorial content was originally developed for the 2017-09 Helsinki workshop and we make the full data files, i.e. the resource files and the BAMs, available at gs://gatk-best-practices/somatic-hg38.

1. Call somatic short variants and generate a bamout with Mutect2

Here we have a rather complex command to call somatic variants on the HCC1143 tumor sample using Mutect2. For a synopsis of what somatic calling entails, see Article#11127. The command calls somatic variants in the tumor sample and uses a matched normal, a panel of normals (PoN) and a population germline variant resource.

gatk --java-options "-Xmx2g" Mutect2 \
-R hg38/Homo_sapiens_assembly38.fasta \
-I tumor.bam \
-I normal.bam \
-tumor HCC1143_tumor \
-normal HCC1143_normal \
-pon resources/chr17_pon.vcf.gz \
--germline-resource resources/chr17_af-only-gnomad_grch38.vcf.gz \
--af-of-alleles-not-in-resource 0.0000025 \
--disable-read-filter MateOnSameContigOrNoMappedMateReadFilter \
-L chr17plus.interval_list \
-O 1_somatic_m2.vcf.gz \
-bamout 2_tumor_normal_m2.bam 

This produces a raw unfiltered somatic callset 1_somatic_m2.vcf.gz, a reassembled reads BAM 2_tumor_normal_m2.bam and the respective indices 1_somatic_m2.vcf.gz.tbi and 2_tumor_normal_m2.bai.

Comments on select parameters

  • Specify the case sample for somatic calling with two parameters. Provide the BAM with -I and the sample's read group sample name (the SM field value) with -tumor. To look up the read group SM field use GetSampleName. Alternatively, use samtools view -H tumor.bam | grep '@RG'.
  • Prefilter variant sites in a control sample alignment. Specify the control BAM with -I and the control sample's read group sample name (the SM field value) with -normal. In the case of a tumor with a matched normal control, we can exclude even rare germline variants and individual-specific artifacts. If we analyze our tumor sample with Mutect2 without the matched normal, we get an order of magnitude more calls than with the matched normal.
  • Prefilter variant sites in a panel of normals callset. Specify the panel of normals (PoN) VCF with -pon. Section 2 outlines how to create a PoN. The panel of normals not only represents common germline variant sites, it presents commonly noisy sites in sequencing data, e.g. mapping artifacts or other somewhat random but systematic artifacts of sequencing. By default, the tool does not reassemble nor emit variant sites that match identically to a PoN variant. To enable genotyping of PoN sites, use the --genotype-pon-sites option. If the match is not exact, e.g. there is an allele-mismatch, the tool reassembles the region, emits the calls and annotates matches in the INFO field with IN_PON.
  • Annotate variant alleles by specifying a population germline resource with --germline-resource. The germline resource must contain allele-specific frequencies, i.e. it must contain the AF annotation in the INFO field [4]. The tool annotates variant alleles with the population allele frequencies. When using a population germline resource, consider adjusting the --af-of-alleles-not-in-resource parameter from its default of 0.001. For example, the gnomAD resource af-only-gnomad_grch38.vcf.gz represents ~200k exomes and ~16k genomes and the tutorial data is exome data, so we adjust --af-of-alleles-not-in-resource to 0.0000025 which corresponds to 1/(2*exome samples). The default of 0.001 is appropriate for human sample analyses without any population resource. It is based on the human average rate of heterozygosity. The population allele frequencies (POP_AF) and the af-of-alleles-not-in-resource factor in probability calculations of the variant being somatic.
  • Include reads whose mate maps to a different contig. For our somatic analysis that uses alt-aware and post-alt processed alignments to GRCh38, we disable a specific read filter with --disable-read-filter MateOnSameContigOrNoMappedMateReadFilter. This filter removes from analysis paired reads whose mate maps to a different contig. Because of the way BWA crisscrosses mate information for mates that align better to alternate contigs (in alt-aware mapping to GRCh38), we want to include these types of reads in our analysis. Otherwise, we may miss out on detecting SNVs and indels associated with alternate haplotypes. Disabling this filter deviates from current production practices.
  • Target the analysis to specific genomic intervals with the -L parameter. Here we specify this option to speed up our run on the small tutorial data. For the full callset we use in section 4, calling was on the entirety of the data, without an intervals file.
  • Generate the reassembled alignments file with -bamout. The bamout alignments contain the artificial haplotypes and reassembled alignments for the normal and tumor and enable manual review of calls. The parameter is not required by the tool but is recommended as adding it costs only a small fraction of the total run time.

To illustrate how Mutect2 applies annotations, below are five multiallelic sites from the full callset. Pull these out with gzcat somatic_m2.vcf.gz | awk '$5 ~","'. The awk '$5 ~","' subsets records that contain a comma in the 5th column.


We see eleven columns of information per variant call including genotype calls for the normal and tumor. Notice the empty fields for QUAL and FILTER, and annotations at the site (INFO) and sample level (columns 10 and 11). The samples each have genotypes and when a site is multiallelic, we see allele-specific annotations. Samples may have additional annotations, e.g. PGT and PID that relate to phasing.

☞ 1.1 What are the Mutect2 annotations?

We can view the standard FORMAT-level and INFO-level Mutect2 annotations in the VCF header.



The Variant Annotations section of the Tool Documentation further describe some of the annotations. For a complete list of annotations available in GATK4, see this site.

To enable specific filtering that relies on nonstandard annotations, or just to add additional annotations, use the -A argument. For example, -A ReferenceBases adds the ReferenceBases annotation to variant calls. Note that if an annotation a filter relies on is absent, FilterMutectCalls will skip the particular filtering without any warning messages.

☞ 1.2 What is the impact of disabling the MateOnSameContigOrNoMappedMateReadFilter read filter?

To understand the impact, consider some numbers. After all other read filters, the MateOnSameContigOrNoMappedMateReadFilter (MOSCO) filter additionally removes from analysis 8.71% (8,681,271) tumor sample reads and 8.18% (6,256,996) normal sample reads from the full data. The impact of disabling the MOSCO filter is that reads on alternate contigs and read pairs that span contigs can now lend support to variant calls.

For the tutorial data, including reads normally filtered by the MOSCO filter roughly doubles the number of Mutect2 calls. The majority of the additional calls comes from the ALT, HLA and decoy contigs.

back to top

2. Create a sites-only PoN with CreateSomaticPanelOfNormals

We make the motions of creating a PoN using three germline samples. These samples are HG00190, NA19771 and HG02759 [3].

First, run Mutect2 in tumor-only mode on each normal sample. In tumor-only mode, a single case sample is analyzed with the -tumor flag without an accompanying matched control -normal sample. For the tutorial, we run this command only for sample HG00190.

gatk Mutect2 \
-R ~/Documents/ref/hg38/Homo_sapiens_assembly38.fasta \
-I HG00190.bam \
-tumor HG00190 \
--disable-read-filter MateOnSameContigOrNoMappedMateReadFilter \
-L chr17plus.interval_list \
-O 3_HG00190.vcf.gz

This generates a callset 3_HG00190.vcf.gz and a matching index. Mutect2 calls variants in the sample with the same sensitive criteria it uses for calling mutations in the tumor in somatic mode. Because the command omits the use of options that trigger upfront filtering, we expect all detectable variants to be called. The calls will include low allele fraction variants and sites with multiple variant alleles, i.e. multiallelic sites. Here are two multiallelic records from 3_HG00190.vcf.gz.


We see for each site, Mutect2 calls the ref allele and three alternate alleles. The GT genotype call is 0/1/2/3. The AD allele depths are 16,3,12,4 and 41,5,24,4, respectively for the two sites.

Comments on select parameters

  • One option that is not used here is to include a germline resource with --germline-resource. Remember from section 1 this resource must contain AF population allele frequencies in the INFO column. Use of this resource in tumor-only mode, just as in somatic mode, allows upfront filtering of common germline variant alleles. This effectively omits common germline variant alleles from the PoN. Note the related optional parameter --max-population-af (default 0.01) defines the cutoff for allele frequencies. Given a resource, and read evidence for the variant, Mutect2 will still emit variant alleles with AF less than or equal to the --max-population-af.
  • Recapitulate any special options used in somatic calling in the panel of normals sample calling, e.g.--disable-read-filter MateOnSameContigOrNoMappedMateReadFilter. This particular option is relevant for alt-aware and post-alt processed alignments.

Second, collate all the normal VCFs into a single callset with CreateSomaticPanelOfNormals. For the tutorial, to illustrate the step with small data, we run this command on three normal sample VCFs. The general recommendation for panel of normals is a minimum of forty samples.

gatk CreateSomaticPanelOfNormals \
-vcfs 3_HG00190.vcf.gz \
-vcfs 4_NA19771.vcf.gz \
-vcfs 5_HG02759.vcf.gz \
-O 6_threesamplepon.vcf.gz

This generates a PoN VCF 6_threesamplepon.vcf.gz and an index. The tutorial PoN contains 8,275 records.
CreateSomaticPanelOfNormals retains sites with variants in two or more samples. It retains the alleles from the samples but drops all other annotations to create an eight-column, sites-only VCF as shown.


Ideally, the PoN includes samples that are technically representative of the tumor case sample--i.e. samples sequenced on the same platform using the same chemistry, e.g. exome capture kit, and analyzed using the same toolchain. However, even an unmatched PoN will be remarkably effective in filtering a large proportion of sequencing artifacts. This is because mapping artifacts and polymerase slippage errors occur for pretty much the same genomic loci for short read sequencing approaches.

What do you think of including samples of family members in the PoN?

☞ 2.1 The tumor-only mode of Mutect2 is useful outside of pon creation

For example, consider variant calling on data that represents a pool of individuals or a collective of highly similar but distinct DNA molecules, e.g. mitochondrial DNA. Mutect2 calls multiple variants at a site in a computationally efficient manner. Furthermore, the tumor-only mode can be co-opted to simply call differences between two samples. This approach is described in Blog#11315.

back to top

3. Estimate cross-sample contamination using GetPileupSummaries and CalculateContamination.

First, run GetPileupSummaries on the tumor BAM to summarize read support for a set number of known variant sites. Use a population germline resource containing only common biallelic variants, e.g. subset by using SelectVariants --restrict-alleles-to BIALLELIC, as well as population AF allele frequencies in the INFO field [4]. The tool tabulates read counts that support reference, alternate and other alleles for the sites in the resource.

gatk GetPileupSummaries \
-I tumor.bam \
-V resources/chr17_small_exac_common_3_grch38.vcf.gz \
-O 7_tumor_getpileupsummaries.table

This produces a six-column table as shown. The alt_count is the count of reads that support the ALT allele in the germline resource. The allele_frequency corresponds to that given in the germline resource. Counts for other_alt_count refer to reads that support all other alleles.


Comments on select parameters

  • The tool only considers homozygous alternate sites in the sample that have a population allele frequency that ranges between that set by --minimum-population-allele-frequency (default 0.01) and --maximum-population-allele-frequency (default 0.2). The rationale for these settings is as follows. If the homozygous alternate site has a rare allele, we are more likely to observe the presence of REF or other more common alleles if there is cross-sample contamination. This allows us to measure contamination more accurately.
  • One option to speed up analysis, that is not used in the command above, is to limit data collection to a sufficiently large but subset genomic region with the -L argument.
  • As of GATK4.0.8.0, released August 2, 2018, GetPileupSummaries requires both -L and -V parameters. For the tutorial, provide the same resources/chr17_small_exac_common_3_grch38.vcf.gz file to each parameter. For details, see the GetPileupSummaries tool documentation.

Second, estimate contamination with CalculateContamination. The tool takes the summary table from GetPileupSummaries and gives the fraction contamination. This estimation informs downstream filtering by FilterMutectCalls.

gatk CalculateContamination \
-I 7_tumor_getpileupsummaries.table \
-O 8_tumor_calculatecontamination.table

This produces a table with estimates for contamination and error. The estimate for the full tumor sample is shown below and gives a contamination fraction of 0.0205. Going forward, we know to suspect calls with less than ~2% alternate allele fraction.


Comments on select parameters

  • CalculateContamination can operate in two modes. The command above uses the mode that simply estimates contamination for a given sample. The alternate mode incorporates the metrics for the matched normal, to enable a potentially more accurate estimate. For the second mode, run GetPileupSummaries on the normal sample and then provide the normal pileup table to CalculateContamination with the -matched argument.

► Cross-sample contamination differs from normal contamination of tumor and tumor contamination of normal. Currently, the workflow does not account for the latter type of purity issue.

☞ 3.1 What if I find high levels of contamination?

One thing to rule out is sample swaps at the read group level.

Picard’s CrosscheckFingerprints can detect sample-swaps at the read group level and can additionally measure how related two samples are. Because sequencing can involve multiplexing a sample across lanes and regrouping a sample’s multiple read groups, depending on the level of automation in handling these, there is a possibility of including read groups from unrelated samples. The inclusion of such a cross-sample in the tumor sample would be detrimental to a somatic analysis. Without getting into details, the tool allows us to (i) check at the sample level that our tumor and normal are related, as it is imperative they should come from the same individual and (ii) check at the read group level that each of the read group data come from the same individual.

Again, imagine if we mistook the contaminating read group data as some tumor subpopulation! The tutorial normal and tumor samples consist of 16 and 22 read groups respectively, and when we provide these and set EXPECT_ALL_GROUPS_TO_MATCH=true, CrosscheckReadGroupFingerprints (a tool now replaced by CrosscheckFingerprints) informs us All read groups related as expected.

back to top

4. Filter for confident somatic calls using FilterMutectCalls

FilterMutectCalls determines whether a call is a confident somatic call. The tool uses the annotations within the callset and applies preset thresholds that are tuned for human somatic analyses.

Filter the Mutect2 callset with FilterMutectCalls. Here we use the full callset, somatic_m2.vcf.gz. To activate filtering based on the contamination estimate, provide the contamination table with --contamination-table. In GATK v4.0.0.0, the tool uses the contamination estimate as a hard cutoff.

gatk FilterMutectCalls \
-V somatic_m2.vcf.gz \
--contamination-table tumor_calculatecontamination.table \
-O 9_somatic_oncefiltered.vcf.gz

This produces a VCF callset 9_somatic_oncefiltered.vcf.gz and index. Calls that are likely true positives get the PASS label in the FILTER field, and calls that are likely false positives are labeled with the reason(s) for filtering in the FILTER field of the VCF. We can view the available filters in the VCF header using grep '##FILTER'.


This step seemingly applies 14 filters, including contamination. However, if an annotation a filter relies on is absent, the tool skips the particular filtering. The filter will still appear in the header. For example, the duplicate_evidence filter requires a nonstandard annotation that our callset omits.

So far, we have 3,695 calls, of which 2,966 are filtered and 729 pass as confident somatic calls. Of the filtered, contamination filters eight calls, all of which would have been filtered for other reasons. For the statistically inclined, this may come as a surprise. However, remember that the great majority of contaminant variants would be common germline alleles, for which we have in place other safeguards.

► In the next GATK version, FilterMutectCalls will use a statistical model to filter based on the contamination estimate.

back to top

5. (Optional) Estimate artifacts with CollectSequencingArtifactMetrics and filter them with FilterByOrientationBias

FilterByOrientationBias allows filtering based on sequence context artifacts, e.g. OxoG and FFPE. This step is optional and if employed, should always be performed after filtering with FilterMutectCalls. The tool requires the pre_adapter_detail_metrics from Picard CollectSequencingArtifactMetrics.

First, collect metrics on sequence context artifacts with CollectSequencingArtifactMetrics. The tool categorizes these as those that occur before hybrid selection (preadapter) and those that occur during hybrid selection (baitbias). Results provide a global view across the genome that empowers decision making in ways that site-specific analyses cannot. The metrics can help decide whether to consider downstream filtering.

gatk CollectSequencingArtifactMetrics \
-I tumor.bam \
-O 10_tumor_artifact \
-R ~/Documents/ref/hg38/Homo_sapiens_assembly38.fasta

Alternatively, use the tool from a standalone Picard jar.

java -jar picard.jar \
CollectSequencingArtifactMetrics \
I=tumor.bam \
O=10_tumor_artifact \

This generates five metrics files, including pre_adapter_detail_metrics, which contains counts that FilterByOrientationBias uses. Below are the summary pre_adapter_summary_metrics for the full data. Our samples were not from FFPE so we do not expect this artifact. However, it appears that we could have some OxoG transversions.



Picard metrics are described in detail here. For the purposes of this tutorial, we focus on the TOTAL_QSCORE.

  • The TOTAL_QSCORE is Phred-scaled such that lower scores equate to a higher probability the change is artifactual. E.g. forty translates to 1 in 10,000 probability. For OxoG, a rough cutoff for concern is 30. FilterByOrientationBias uses the quality score as a prior that a context will produce an artifact. The tool also weighs the evidence from the reads. For example, if the QSCORE is 50 but the allele is supported by 15 reads in F1R2 and no reads in F2R1, then the tool should filter the call.
  • FFPE stands for formalin-fixed, paraffin-embedded. Formaldehyde deaminates cytosines and thereby results in C→T transition mutations. Oxidation of guanine to 8-oxoguanine results in G→T transversion mutations during library preparation. Another Picard tool, CollectOxoGMetrics, similarly gives Phred-scaled scores for the 16 three-base extended sequence contexts. In GATK4 Mutect2, the F1R2 and F2R1 annotations count the reads in the pair orientation supporting the allele(s). This is a change from GATK3’s FOXOG (fraction OxoG) annotation.

Second, perform orientation bias filtering with FilterByOrientationBias. We provide the tool with the once-filtered calls 9_somatic_oncefiltered.vcf.gz, the pre_adapter_detail_metrics file and the sequencing contexts for FFPE (C→T transition) and OxoG (G→T transversion). The tool knows to include the reverse complement contexts.

gatk FilterByOrientationBias \
-A G/T \
-A C/T \
-V 9_somatic_oncefiltered.vcf.gz \
-P tumor_artifact.pre_adapter_detail_metrics.txt \
-O 11_somatic_twicefiltered.vcf.gz

This produces a VCF 11_somatic_twicefiltered.vcf.gz, index and summary 11_somatic_twicefiltered.vcf.gz.summary. In the summary, we see the number of calls for the sequence context and the number of those that the tool filters.


Is the filtering in line with our earlier prediction?

In the VCF header, we see the addition of the 15th filter, orientation_bias, which the tool applies to 56 calls. All 56 of these calls were previously PASS sites, i.e. unfiltered. We now have 673 passing calls out of 3,695 total calls.


☞ 5.1 Tally of applied filters for the tutorial data

The table shows the breakdown in filters applied to 11_somatic_twicefiltered.vcf.gz. The middle column tallys the instances in which each filter was applied across the calls and the third column tallys the instances in which a filter was the sole reason for a site not passing. Of the total calls, ~18% (673/3,695) are confident somatic calls. Of the filtered calls, ~56% (1,694/3,022) are filtered singly. We see an average of ~1.73 filters per filtered call (5,223/3,022).


Which filters appear to have the greatest impact? What types of calls do you think compels manual review?

Examine passing records with the following command. Take note of the AD and AF annotation values in particular, as they show the high sensitivity of the caller.

gzcat 11_somatic_twicefiltered.vcf.gz | grep -v '#' | awk '$7=="PASS"' | less

back to top

6. Set up in IGV to review somatic calls

Deriving a good somatic callset involves comparing callsets, e.g. from different callers or calling approaches, manually reviewing passing and filtered calls and, if necessary, combining callsets and additional filtering. Manual review extends from deciphering call record annotations to the nitty-gritty of reviewing read alignments using a visualizer.

To manually review calls, use the feature-rich desktop version of the Integrative Genomics Viewer (IGV). Remember that Mutect2 makes calls on reassembled alignments that do not necessarily reflect that of the starting BAM. Given this, viewing the raw BAM is insufficient for understanding calls. We must examine the bamout that Mutect2's graph-assembly produces.

First, load Human (hg38) as the reference in IGV. Then load these six files in order:

  • resources/chr17_pon.vcf.gz
  • resources/chr17_af-only-gnomad_grch38.vcf.gz
  • 11_somatic_twicefiltered.vcf.gz
  • 2_tumor_normal_m2.bam
  • normal.bam
  • tumor.bam

With the exception of the somatic callset 11_somatic_twicefiltered.vcf.gz, the subset regions the data cover are in chr17plus.interval_list.

imageSecond, navigate IGV to the TP53 locus (chr17:7,666,402-7,689,550).

  • One of the tracks is dominating the view. Right-click on track chr17_af-only-gnomad_grch38.vcf.gz and collapse its view.
  • imageZoom into the somatic call in 11_somatic_twicefiltered.vcf.gz, the gray rectangle in exon 3, by click-dragging on the ruler.
  • Hover over or click on the gray call in track 11_somatic_twicefiltered.vcf.gz to view INFO level annotations. Similarly, the blue call underneath gives HCC1143_tumor sample level information.
  • Scroll through the alignment data and notice the coverage for the samples.

A C→T variant is in tumor.bam but not normal.bam. What is happening in 2_tumor_normal_m2.bam?

imageThird, tweak IGV settings that aid in visualizing reassembled alignments.

  • Make room to focus on track 2_tumor_normal_m2.bam. Shift+select on the left panels for tracks tumor.bam, normal.bam and their coverages. Right-click and Remove Tracks.
  • Go to View>Preferences>Alignments. Toggle on Show center line and toggle off Downsample reads.
  • Drag the alignments panel to center the red variant.
  • Right-click on the alignments track and

    • Group by sample
    • Sort by base
    • Color by tag: HC.
  • Scroll to take note of the number of groups. Click on a read in each group to determine which group belongs to which sample.


What are the three grouped tracks for the bamout? What does the pastel versus gray colors indicate? How plausible is it that all tumor copies of this locus have this alteration?

Here is the corresponding VCF record. Remember Mutect2 makes no ploidy assumption. The GT field tabulates the presence for each allele starting with the reference allele.


chr17 7,674,220 . C T . PASS DP=122;ECNT=1;NLOD=13.54;N_ART_LOD=-1.675e+00;POP_AF=2.500e-06;P_GERMLINE=-1.284e+01;TLOD=257.15
HCC1143_normal 0/0:45,0:0.032:19,0:26,0:0:151,0:0:0:false:false
HCC1143_tumor 0/1:0,70:0.973:0,34:0,36:33:0,147:60:21:true:false:0.486:0.00:46.01:100.00:0.990,0.990,1.00:0.028,0.026,0.946

Finally, here are the indel calls for which we have bamout alignments. All 17 of these happen to be filtered. Explore a few of these sites in IGV to practice the motions of setting up for manual review and to study the logic behind different filters.

chr17 4,539,344 T TA artifact_in_normal;germline_risk;panel_of_normals
chr17 7,221,420 CACTGCCCTAGGTCAGGA C artifact_in_normal;panel_of_normals;str_contraction
chr17 7,483,063 A AC mapping_quality;t_lod
chr17 8,513,688 GTT G panel_of_normals
chr17 19,748,387 G GA t_lod
chr17 26,982,033 G GC artifact_in_normal;clustered_events
chr17 30,059,463 CT C t_lod
chr17 35,422,473 C CA t_lod
chr17 35,671,734 CTT C,CT,CTTT artifact_in_normal;multiallelic;panel_of_normals
chr17 43,104,057 CA C artifact_in_normal;germline_risk;panel_of_normals
chr17 43,104,072 AAAAAAAAAGAAAAG A panel_of_normals;t_lod
chr17 46,332,538 G GT artifact_in_normal;panel_of_normals
chr17 47,157,394 CAA C panel_of_normals;t_lod
chr17 50,124,771 GCACACACACACACACA G clustered_events;panel_of_normals;t_lod
chr17 68,907,890 GA G artifact_in_normal;base_quality;germline_risk;panel_of_normals;t_lod
chr17 69,182,632 C CA artifact_in_normal;t_lod
chr17 69,182,835 GAAAA G panel_of_normals

back to top

7. Related resources

The next step after generating a carefully manicured somatic callset is typically functional annotation.

  • Funcotator is available in BETA and can annotate GRCh38 and prior reference aligned VCF format data.
  • Oncotator can annotate GRCh37 and prior reference aligned MAF and VCF format data. It is also possible to download and install the tool following instructions in Article#4154.
  • Annotate with the external program VEP to predict phenotypic changes and confirm or hypothesize biochemical effects.

For a cohort, after annotation, use MutSig to discover driver mutations. MutsigCV (the version is CV) is available on GenePattern. If more samples are needed to increase the power of the analysis, consider padding the analysis set with TCGA Project or other data.

The dSKY plot at https://figshare.com/articles/D_SKY_for_HCC1143/2056665 shows somatic copy number alterations for the HCC1143 tumor sample. Its colorful results remind us that calling SNVs and indels is only one part of cancer genome analyses. Somatic copy number alteration detection will be covered in another GATK tutorial. For reference implementations of Somatic CNV workflows see here.

back to top


[1] Data was alt-aware aligned to GRCh38 and post-alt processed. For an introduction to alt-aware alignment and post-alt processing, see [Blog#8180](https://software.broadinstitute.org/gatk/blog?id=8180). The HCC1143 alignments are identical to that in [Tutorial#9183](https://software.broadinstitute.org/gatk/documentation/article?id=9183), which uses GATK3 MuTect2.

[2] For scripted GATK Best Practices Somatic Short Variant Discovery workflows, see [https://github.com/gatk-workflows](https://github.com/gatk-workflows). Within the repository, as of this writing, [gatk-somatic-snvs-indels](https://github.com/gatk-workflows/gatk4-somatic-snvs-indels), which uses GRCh37, is the sole GATK4 Mutect2 workflow. This tutorial uses additional parameters not used in the [GRCh37 gatk-somatic-snvs-indels](https://github.com/gatk-workflows/gatk4-somatic-snvs-indels) example because the tutorial data was preprocessed with post-alt processing of alt-aware alignments, which deviates from production practices. The general workflow steps remain the same.

[3] About the tutorial data:

  • The data tarball contains 15 files in the main directory, six files in its resources folder and twenty files in its precomputed folder. Of the files, chr17 refers to data subset to that in the regions in chr17plus.interval_list, the m2pon consists of forty 1000 Genomes Project samples, pon to panel of normals, tumor to the tumor HCC1143 breast cancer sample and normal to its matched blood normal.
  • Again, example data are based on a breast cancer cell line and its matched normal cell line derived from blood. Both cell lines are consented and known as HCC1143 and HCC1143_BL, respectively. The Broad Cancer Genome Analysis (CGA) group has graciously provided 2x76 paired-end whole exome sequence data from the two cell lines (C835.HCC1143_2 and C835.HCC1143_BL.4), and @shlee reverted and aligned these to GRCh38 using alt-aware alignment and post-alt processing as described in Tutorial#8017. During preprocessing, the MergeBamAlignment step was omitted, reads containing adapter sequence were removed altogether for both samples (~0.153% of reads in the tumor) as determined by MarkIlluminaAdapters, base qualities were not binned during base recalibration and indel realignment was included to match the toolchain of the PoN normals. The program group for base recalibration is absent from the BAM headers due to a bug in the version of PrintReads at the time of pre-processing, in January of 2017.
  • Note that the tutorial uses exome data for its small size. The workflow is applicable to whole genome sequence data (WGS).
  • @shlee lifted-over or remapped the gnomAD resource files from GRCh37 counterparts to GRCh38. The tutorial uses subsets of the full resources; the full-length versions are available at gs://gatk-best-practices/somatic-hg38/. The official GRCh37 versions of the resources are available in the GATK Resource Bundle and are based on the gnomAD resource. These GRCh37 versions were prepared by @davidben according to the method outlined in the mutect_resources.wdl and described in [4].
  • The full data in the tutorial were generated by @shlee using the github.com/broadinstitute/gatk mutect2.wdl from between the v4.0.0.0 and v4.0.0.1 release with commit hash b4d1ddd. The GATK Docker image was broadinstitute/gatk: and Picard was v2.14.1. A single modification was made to the script to enable generating the bamout. The script was run locally on a Google Cloud Compute VM using Cromwell v30.1. Given Docker was installed and the specified Docker images were present on the VM, Cromwell automatically launched local Docker container instances during the run and handled the local files as hard-links to avoid redundant copying. Workflow input variables were as follows.
  "Mutect2.is_run_oncotator": "False",
  "Mutect2.is_run_orientation_bias_filter": "True",
  "Mutect2.picard": "/home/shlee/picard-2.14.1.jar",
  "Mutect2.gatk_docker": "broadinstitute/gatk:",
  "Mutect2.oncotator_docker": "broadinstitute/oncotator:",
  "Mutect2.artifact_modes": ["G/T", "C/T"],
  "Mutect2.m2_extra_args": "--af-of-alleles-not-in-resource 0.0000025 --disable-read-filter MateOnSameContigOrNoMappedMateReadFilter",
  "Mutect2.m2_extra_filtering_args": "",
  "Mutect2.scatter_count": "10"
  • If using newer versions of the mutect2.wdl that allow setting SplitIntervals optional arguments, then @shlee recommends setting --subdivision-mode BALANCING_WITHOUT_INTERVAL_SUBDIVISION to avoid splitting contigs.
  • With the exception of the PoN and Picard tool steps, data was generated using v4.0.0.0. The PoN was generated using GATK4 vbeta.6. Besides the syntax, little changed for the Mutect2 workflow between these releases and the workflow and most of its tools remain in beta status as of this writing. We used Picard v2.14.1 for the CollectSequencingArtifactMetrics step. Figures in section 5 reflect results from Picard v2.11.0, which give, at glance, identical results as 2.14.1.
  • The three samples in section 2 are present in the forty sample PoN used in section 1 and they are 1000 Genomes Project samples.

[4] The WDL script [mutect_resources.wdl](https://github.com/broadinstitute/gatk/blob/master/scripts/mutect2_wdl/mutect_resources.wdl) takes a large gnomAD VCF or other typical cohort VCF and from it prepares both a simplified germline resource for use in _section 1_ and a common biallelic variants resource for use in _section 3_. The script first generates a sites-only VCF and in the process _removes all extraneous annotations_ except for `AF` allele frequencies. We recommend this simplification as the unburdened VCF allows Mutect2 to run much more efficiently. To generate the common biallelic variants resource, the script then selects the biallelic sites from the sites-only VCF.

back to top

Post edited by bhanuGandham on


  • Tiffany_at_BroadTiffany_at_Broad Cambridge, MAMember, Administrator, Broadie, Moderator admin

    I got an error (bash: gzcat: command not found) when running gzcat somatic_m2.vcf.gz | awk '$5 ~","' It worked when I replaced gzcat with zcat though.

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭
    edited February 2018

    Hi @Tiffany_at_Broad. Yes, that's a difference in default utility tools available in UNIX vs. LINUX. We write these tutorials to be run on UNIX laptops. Thanks for clarifying for readers.

  • jeffmengjeffmeng BroadMember, Broadie

    @shlee I'm trying to run Mutect2 on FireCloud. Do I need a gatk_jar, or only a gatk_docker? What workspace attribute should I put down for gatk_docker? Thanks!

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭
    edited February 2018

    Sorry @jeffmeng but I'm not familiar with FireCloud's requirements. I do know for the Mutect2 WDL that you need to specify these options in the JSON inputs file. If you do not want to use the Docker, you can remove the lines that mention them from the script. Let me have my colleague @KateN jump in here.

  • KateNKateN Cambridge, MAMember, Broadie, Moderator admin

    Hi @jeffmeng,

    Locally, @shlee is correct that you would specify these options in a JSON inputs file. As I'm sure you're aware, on FireCloud you specify the inputs either directly in a method configuration, or as workspace attributes.

    What you need here is a gatk_docker to run on FireCloud. For our own Mutect2 workspace (help-gatk/Somatic-SNVs-Indels-GATK4), we use the docker image us.gcr.io/broad-gatk/gatk. This is a GATK docker image hosted on GCR (Google Cloud Registry) that you can use in your own analyses. If you have any additional questions on using the M2 workflow in FireCloud, the documentation in the above-named workspace can be a great place to start. Otherwise, unless the question is on the logic of Mutect2 itself, I can help with FireCloud-related questions on the FireCloud forum.

  • Hi,

    I'm trying to create a PoN file. First, I ran the following command on each normal sample and created .vcf.gz and vcf.gz.tbi files.

    gatk Mutect2 \
    -R library/human_g1k_v37_decoy.fasta \
    -I GATK/BaseRecalibration/G0288_recal_reads.bam \
    -tumor G0288 \
    -L library/targets.interval_list \
    --disable-read-filter MateOnSameContigOrNoMappedMateReadFilter \
    -O GATK/PON/G0288.vcf.gz

    But I cannot collate my .vcf.gz files to create a final pon file. My code and the error message;

    gatk CreateSomaticPanelOfNormals \
    -vcfs GATK/PON/G0098.vcf.gz \
    -vcfs GATK/PON/G0288.vcf.gz \
    -vcfs GATK/PON/G5582.vcf.gz \
    -vcfs GATK/PON/G7287.vcf.gz \
    -vcfs GATK/PON/G9348.vcf.gz \
    -O GATK/PON/finalpon.vcf.gz

    [February 9, 2018 4:48:05 PM EET] org.broadinstitute.hellbender.tools.walkers.mutect.CreateSomaticPanelOfNormals done. Elapsed time: 0.00 minutes.
    htsjdk.tribble.TribbleException$MalformedFeatureFile: Unable to create BasicFeatureReader using feature file , for input source: file:///media/esra/Kucuk%20Lab/esra%20data%20analysis/GATK/PON/G0098.vcf.gz
    at htsjdk.tribble.AbstractFeatureReader.getFeatureReader(AbstractFeatureReader.java:113)
    at htsjdk.tribble.AbstractFeatureReader.getFeatureReader(AbstractFeatureReader.java:74)
    at htsjdk.variant.vcf.VCFFileReader.(VCFFileReader.java:117)
    at htsjdk.variant.vcf.VCFFileReader.(VCFFileReader.java:68)
    at org.broadinstitute.hellbender.tools.walkers.mutect.CreateSomaticPanelOfNormals.doWork(CreateSomaticPanelOfNormals.java:136)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:136)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:179)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:198)
    at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:153)
    at org.broadinstitute.hellbender.Main.mainEntry(Main.java:195)
    at org.broadinstitute.hellbender.Main.main(Main.java:277)
    Caused by: java.io.FileNotFoundException: /media/esra/Kucuk%20Lab/esra%20data%20analysis/GATK/PON/G0098.vcf.gz (No such file or directory)
    at java.io.RandomAccessFile.open0(Native Method)
    at java.io.RandomAccessFile.open(RandomAccessFile.java:316)
    at java.io.RandomAccessFile.(RandomAccessFile.java:243)
    at htsjdk.samtools.seekablestream.SeekableFileStream.(SeekableFileStream.java:47)
    at htsjdk.samtools.seekablestream.SeekableStreamFactory$DefaultSeekableStreamFactory.getStreamFor(SeekableStreamFactory.java:99)
    at htsjdk.tribble.readers.TabixReader.(TabixReader.java:129)
    at htsjdk.tribble.TabixFeatureReader.(TabixFeatureReader.java:80)
    at htsjdk.tribble.AbstractFeatureReader.getFeatureReader(AbstractFeatureReader.java:106)
    ... 10 more


  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @esrabytk,

    I believe there was some bug in processing vcf.gz files that has since been fixed. Try the latest release or (depending on when you read this) OR try using uncompressed vcf files.

  • Hi,

    I used vcf files and final PoN file looks OK.


  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    UPDATE: Mutect2 and FilterMutectCalls are in production status as of GATK v4.0.2.0.

  • KlausNZKlausNZ Member ✭✭

    Hi Team,
    Could you please clarify whether removing the paired flag (0x1) from alignments is required for T|N calling with Mutect2?
    Tutorial #11136 refers to #8017 for alt-aware mapping and processing, which does recommend 0x1 removal for HaplotypeCaller (section 6.1), but I'm unclear whether the same is required (or recommended) for Mutect2. Sorry I couldn't clarify this via mutect2.wdl (which commences after post-processing....)
    Many thanks in advance!

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @KlausNZ,

    The MateOnSameContigOrNoMappedMateReadFilter is exposed in GATK4 for both Mutect2 and HaplotypeCaller. The filter is actually not at the engine level but employed by the assembly machinery. This was the filter I was trying to go around by removing the paired 0x1 flag in Tutorial#8017, which uses GATK3 HaplotypeCaller.

    So far as I know, for GATK3 this filter cannot be disabled and remains unexposed throughout the last versions.

    For GATK4, no, you do not need to remove the 0x1 flag, so long as you disable the MateOnSameContigOrNoMappedMateReadFilter. You can do this with --disable-read-filter MateOnSameContigOrNoMappedMateReadFilter.

    It is up to you to decide how you want to weight your supplementary alignment calls. I outlined these options because they have implications for calling on GRCh38 alt-aware mapped reads. I do not imply any requirement or recommendation. What is important is that your case sample and your PON match in the manner they were processed and called upon. Hope this helps clarify things.

  • KlausNZKlausNZ Member ✭✭

    Hi @shlee, many thanks for clarifying this!
    Could you provide some guidance about the minimum/optimum count of PON cases?
    Also, would you expect coverage differences (at otherwise identical technologies) between PON (100x) and tumour (300X) to cause problems?
    Many thanks!

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭
    edited March 2018

    @KlausNZ, take a look at https://software.broadinstitute.org/gatk/documentation/article?id=11053 and https://software.broadinstitute.org/gatk/documentation/article?id=11127. The latter article explains how the PoN catches low-frequency systematic artifacts that may not occur for every sample, thus necessitating a panel of samples.

    I could imagine coverage differences could make a difference depending on how those coverages were achieved:

    • The pon samples and tumor sample started with the same amount of template DNA from which the differential coverage was achieved through additional rounds of PCR amplification for the tumor.
    • The tumor starts with more template DNA and the rounds of PCR amplification are equivalent to achieve the differential coverage.

    For the 100x and 300x coverage difference you cite, for either scenario above and assuming all else went well and equivalently for the sample preparations, the library complexity should be fairly equivalent. I don't see any problems, really for the Mutect2 workflow. Do you suspect any reason for concern?

  • KlausNZKlausNZ Member ✭✭

    Hi @shlee,
    Thanks for the pointer to #11053 (sorry I missed that) and yes, that sample number fits my expectations better than PON=3 in the example.
    Re coverage, the main rationale for the increase in tumour coverage is heterogeneity (unknown at the time of sequencing); we assume 300x would give us the targeted 100x tumour if as few as 33% of the cells in the sample are tumour cells. So we'd aim to start with a higher number of cells == more template DNA/equivalent rounds of PCR compared to the N sample.
    I didn't have specific concerns - just keen to hear your expert thoughts ;-)
    Thanks you very much again!

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    I missed the recent appearance of Article#11053 in our documentation as well @KlausNZ. I just happened to find it by searching. I've amended the tutorial to refer to it in the PoN creation step. Thank you for bringing this up.

    I make tutorial data small on purpose so that the data bundles are easily downloadable and folks can complete the tutorials on their laptops within a reasonable amount of time and memory. I hope that's understood.

    You bring up another good point with tumor heterogeneity. We can add tumor purity to that, as tumor samples often have a proportion of normal representation. With TCGA matched samples, as well as the HCC1143 sample set in the current tutorial, the tumor sample has more coverage than the matched normal by design. I would think though that picking up artifacts of sequencing is independent of sample origin.

    I'm no expert--more of a student. Thanks for the shoutout though. = D

  • iremsarihaniremsarihan Member

    Things I've noticed while going through this tutorial:

    Resources folder does not have chr17_pon.vcf.gz pon file for the first part of the code (instead there is chr17_m2pon.vcf.gz)

    In CollectSequencingArtifactMetrics,
    –-FILE_EXTENSION ".txt" \ seems to be an invalid argument, instead -EXT ".txt" \ works

    In FilterByOrientationBias,
    -A G/T \
    -A C/T \ lines gives the error: A is not a recognized option. I guess it supposed to be -AM instead

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Thanks, @iremsarihan for pointing these out. Can you share with us the versions of programs you are using? The tutorial uses GATK v4.0.0.0 and I must account for version differences in updates.

  • iremsarihaniremsarihan Member

    Yes sure @shlee , I was using GATK and Java version 1.8.0_151, maybe that could be the reason.
    Also, if I want to skip IGV part and view the results as a table, what would your suggestion be? I guess there is VariantsToTable but are there any other options for managing vcf files? Thanks!

    Issue · Github
    by shlee

    Issue Number
    Last Updated
  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Thanks, @iremsarihan. I'll look into the GATK minor version difference. The Java minor version should not impact results. As for viewing results outside of IGV, are you looking for ways to parse the results via more computational methods? If so, we have some Mutect2 WDL scripts that aid in a variety of validations, e.g. sensitivity and specificity calculations. They are at https://github.com/broadinstitute/gatk/tree/master/scripts/mutect2_wdl/unsupported. Notice these scripts are unsupported, which just means we do not support questions on them (we are a small team!). These validation scripts are used by our developers.

    Otherwise, perhaps you can look into whether GenomicsDBImport allows you to make a datastore of Mutect2 results. If not, and if you think this would really enable your research, then let us know and we can put in a feature request for such compatibility.

    I hope this answers your question.

  • iremsarihaniremsarihan Member

    Thank you @shlee for the links and your help. I've another thing to ask, if you don't mind. I am using Mutect2 on some matched tumor vs normal samples that were aligned with hg19.
    I've created the unfiltered vcf, but would like to filter it eventually. I am stuck at GetPileupSummaries step as I can't provide a common germline variant sites file. I found ''common_all.vcf.gz'' file from this link below but its ref is the newest genome and also does not have AF in its header therefore creating an error. Thanks for your time and support.

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭
    edited March 2018

    Hi @iremsarihan,

    We provide a common germline variants file in the GATK Bundle. The b37 version is available on both our Google Cloud bucket and our FTP server. You can find links to both at https://software.broadinstitute.org/gatk/download/bundle. In the FTP site, navigate to Mutect2/GetPileupSummaries/. The file is called small_exac_common_3. You will see both b37 and hg38 versions. For hg19, you will have to liftover from the b37 version using a UCSC chain file and Picard LiftOverVcf.

  • Adam_U0Adam_U0 Member

    Dear Shlee,

    I have 12 samples - tumor WES - sorted bam's, DNA sequenced from tumor tissues
    and one normal sample, also WES, however DNA from blood has been sequenced.
    Is it possible to call somatic variants with this data?
    I'm a bit confused because you used several normal samples during PoN creation.

    Best regards,

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @Adam_U0,

    Your setup is not clear to me. Do you have tumor matched normals for each tumor sample? If so, then you can call somatic variants using the matched-normal mode. We highly recommend the use of a matched normal, as otherwise, the workflow will call many germline variants, even with the germline resource and PoN.

    For the matched normal, matched blood is preferred unless the tumor sample is a blood cancer. Tumor contamination in normal is not something our workflow handles. So it's best for the matched normal be obtained from a tissue source where there will be no tumor contamination. Otherwise, you should check out the tools the Broad CGA offers.

    I think you will find https://software.broadinstitute.org/gatk/documentation/article?id=11127 helpful.

  • Adam_U0Adam_U0 Member

    Dear @shlee,

    I have 13 patients sequenced (.bam files, same tumor, not blood cancer), 12 of them are tumor samples and in case of 13rd patient I have both tumor and blood sequenced. I'd like to call somatic variants in those 12 tumor samples using blood sample. So I have 12 tumors and 1 normal and unfortunately from different patient. I'm afraid that I don't have normal-tumor for each patient.

    I'd like to know your opinion if gatk can handle this set of data.

    Best regatds,

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @Adam_U0,

    Certainly, our GATK4 somatic calling workflows allow you to process tumors sans matched normal.
    First I would like for you to check out a poster in our posters folders. A link to the posters folder can be found on https://software.broadinstitute.org/gatk/documentation/presentations. Take a look at either 2017-ASHG_GATK4_new_functionalities.pdf or 2017-BR-GATK4-new-functionalities.pdf for an overview of GATK4 somatic workflows.

    You should be analyzing your tumor samples for (i) CNVs and (ii) short mutations.

  • Adam_U0Adam_U0 Member

    Dear @shlee,

    thank you so much for the comprehensive answer. By and large I'm interested in only in somatic short mutations, not even indels but SNPs mostly. I'll try t odo my analysis with my one normal sample.

    Best regards,

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi Adam (@Adam_U0),

    I'm glad we answered your question. I do hope you come back to CNVs. Aneuploidy is common in cancers and if you are looking for driver mutations within a cohort, it is essential to account for CNVs together with the short mutations. This is why I went to the trouble of obtaining and including the following slide in our workshop presentation on somatic mutations.

    In this plot, for the given brain cancer cohort, alteration types with yellow labels are organized per patient sample. We have mutation rate & significance, clinical parameters, copy-number gain and loss, and so on. For each alteration type, data is organized vertically by genes or chromosomal foci. This dataset reveals that patients without the characteristic IDH, TP53, and ATRX mutations have increased copy-number alterations--both gains and losses. This illustrates a cancer type (as we classify it) can have different underlying genomic alterations. And so cancer analysis must take into account the wider spectrum of alterations to identify driver events.

    You can view similar plots for different TCGA cancer cohorts at http://firebrowse.org/iCoMut.

  • ShishiMinShishiMin Member

    Hi @shlee, I want to know whether this pipline of call somatic mutations using GATK4 Mutect2 is suitable for single cell sequencing data? I have 96 single cell data of 16 tumor sample, 16 bulk cell WES data of the same 16 tumor sample , and 16 bulk cell WES data of the mached normal. I am confused how to set the
    -I normal.bam option. I don't konw whether to choose the WES data of the same 16 tumor sample or WES data of the mached normal as the normal.bam.
    Looking forward to your reply.Thanks!

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @ShishiMin,

    For somatic calling the normal should be your matched normal. This enables Mutect2 to filter against variant sites in the normal to exclude germline variant sites in your final callset.

    Just out of curiosity, how will you handle your single cell data? Will you pool them?

  • ShishiMinShishiMin Member

    Hi @shlee, I want to identify the true somatic mutations in my single cell data.In the analysis of single cell DNA sequencing data, SNVs not present in the bulk alignment are designated as true somatic SNVs rather than SNPs. The analysis need the matched bulk bam files rather than the nomal files, so I think the parameter -I in the GATK Mutect2 may be set with the bulk tissue bam files(that is bulk cell WES data of the same 16 tumor sample). I don't know if my understanding is wrong.

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @ShishiMin,

    Let me see if I understand your setup. You have

    • 96 scDNA x 16 tumor case samples
    • 16 bulk exome data x same 16 tumor cases
    • 16 bulk exome data x 16 matched normal controls

    Based on

    identify the true somatic mutations in my single cell data

    It seems you are interested in tumor heterogeneity. I think you may be interested in recent discussions on this thread related to merging different libraries per sample. Other somatic short variant resources are listed here. Remember that you should also consider CNVs. Depending on the coverage you get for your scDNA samples, I imagine you will want to do a two-pronged analysis--per sample and per pool. Seems an interesting use case for us to support, so I'd like to get our developer, @davidben to join the discussion.

    In the meanwhile, would you mind sharing more about your experimental approach and what you expect to see? We can take this discussion offline if you want.

  • davidbendavidben BostonMember, Broadie, Dev ✭✭✭

    A lot will depend on the coverage. Also, the upcoming M2 GGA mode aka force calling, could be useful here.

  • EADGEADG KielMember ✭✭✭

    Hi @shlee,
    thank you at first again for another nice tutorial. I Have some question about the Picard’s CrosscheckFingerprints.

    In the doc`s the commandline example mentioned a HAPLOTYPE_DATABASE I assume that corrospond to HAPLOTYPE_MAP :) ?

    My second question is how I get/create that haplotype_map, is there a source or a tutorial how to create it, the doc´s metioned the format but not how to create it.

    Thank you in advance,

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭
    edited April 2018

    Hi @EADG,

    I detail the haplotype map format here. Also, I think you'll find this discussion helpful.

    P.S. Thanks for liking this tutorial! I put a lot of effort into developing and writing these and it means a lot to me to get thanks from researchers.

  • iremsarihaniremsarihan Member

    Hi @shlee,
    If I am running Mutect2 with same samples but different parameters, do I really need to repeat the step 5 CollectSequencingArtifactMetrics? It looks like it's not dependent on the previous filtering step but as you emphasized this on the tutorial I wanted to check.
    Another question is do you have any documents for default parameters to run in tumor only mode? When I set af-of-alleles-not-in-resource = 0.001, I get no variants (GATK When I set it to 0 (as recommended in this discussion: https://gatkforums.broadinstitute.org/gatk/discussion/10157/gatk4-beta-no-filter-passing-variants-in-mutect2-tumor-only-runs-using-default-parameters) I get over 3000 variants, so I am trying to find what is best way to determine this number for tumor only mode.
    Last but not least thanks a lot for this tutorial and documentation. I keep coming back and reading this as its loaded with information.

    Issue · Github
    by shlee

    Issue Number
    Last Updated
    Closed By
  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭
    edited May 2018

    Hi @iremsarihan,

    That depends on the version of Picard you ran previously and the sequence contexts you will use to filter. Sometime between my writing the January-2017 GATK3 MuTect2 tutorial (the first iteration) and September-2017 Helsinki Mutect2 tutorial, there was an update to Picard CollectSeuencingArtifactMetrics that give the following different results for the identical BAM file.

    older Picard CollectSeuencingArtifactMetrics

    newer Picard CollectSeuencingArtifactMetrics

    Notice the difference in the G-->C context. I assume the newer Picard fixed some sort of bug. Sorry, I forget the exact version of Picard that changes results. So I'd rerun this QC analysis with the latest Picard if your original runs were from a while back.

    As for your af-of-alleles-not-in-resource question, let's consult @davidben as I haven't followed the particular discussion thread you link.

    Post edited by shlee on
  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Actually, (at glance) it looks like @Sheila is heavily involved in the thread, so let's pull her in here also.

  • SheilaSheila Broad InstituteMember, Broadie ✭✭✭✭✭

    Hi Irem,

    Can you post the exact command you are running? Have you tried setting --af_of_alleles_not_in_resource 1/400,000 if you are inputting a germline resource? Does that make a difference?


  • davidbendavidben BostonMember, Broadie, Dev ✭✭✭

    FYI, In the most recent release,, Mutect2 automatically chooses different defaults depending on whether there is a matched normal.

  • iremsarihaniremsarihan Member

    Hi @Sheila,
    My exact command line was the same I used for tumors with matched normal, just without a matched sample -posted below. For pon and common germline variants I am using the files from resource bundle. af-of-alleles-not-in-resource 0.001 leaves me with no variants so I didn't try 1/400,000. I also tried FilterMutectCalls with --max-germline-posterior 0.5, but still nothing. When I change af-of-alleles-not-in-resource 0, I get around 3000 variants per sample.

    gatk Mutect2            
    -R /path/to/Homo_sapiens_assembly19.fasta           
    -I /path/to/RTF1.bam -tumor RTF1    
    -pon path/to/refseq_exome_10bp_hg19_300_1kg_normal_panel.vcf                    
    --germline-resource /path/to/common_germline_variants/af-only-gnomad.raw.sites.b37.vcf.gz           
    --af-of-alleles-not-in-resource 0.001           
    -O /path/to/outputs/RTF1_somatic_m2.vcf.gz 

    I am using, as it is the one installed in our high comp.cluster, but I can try locally with if it will make a difference, but the problem I have is with tumor only mode.

  • davidbendavidben BostonMember, Broadie, Dev ✭✭✭

    Setting af-of-alleles-not-in-resource to 0 is reasonable as it basically turns off germline filtering of variants that aren't in your resource. If you have a low-purity tumor, however, you are better off using 1/400,000 because that will have a chance of filtering some germline hets while keeping the somatic variants with lower allele fractions.

  • sergey_ko13sergey_ko13 Member

    Dear All,
    thank You for excellent howto and discussion!
    my questions are rather about pre-processing
    1) running mutect2 on 40 T/N pairs - would we benefit from BQSR of all samples within one BAM first and later devide it to tumor.bam and normal.bam?
    2) I understand we need to clculate contamination table on tumor.bam only, but are there other reasons to use a separate tumor.bam instead of keeping all alignments in one 'MY.BAM' specifying only -tumor and -normal readgroup's SMs when we need?


  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @sergey_ko13,

    BQSR is part of the Preprocessing Best Practices for both germline and somatic pipelines.

    GATK is read group aware. This means that for tools that differentiate read group elements or take in sample name(s) as a parameter(s), you can have tumor and normal reads in the same file, so long as their sample names are different! Otherwise, typically, tools expect per sample BAMs.

  • sergey_ko13sergey_ko13 Member

    thank you @shlee

    here sample names differ for T and N so we are safe.
    My question was rather about to put more reads from same lanes in one bam (use both T and N) for better (?) BQSR and later use that common TN BAM for MuTect2 - if that makes sense? And separate tumor alignments for sequencing artifacts and contamination detection only.

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    @sergey_ko13, the different lane-level reads should have different read group IDs. This is critical for BQSR. The different read groups constitute a sample-level BAM file. Are you asking if our downstream tools are read group aware? So far as I know they are not.

  • sergey_ko13sergey_ko13 Member

    Dear @shlee ,

    I am sorry for I have asked, probably, not clear enough.
    BQSR (said in the best practices) benefits from more reads. So we plan to run it on all our alignments (including tumor and normal) together at once. And we wonder if there is anything against of using that one resulting BAM for mutect2 input?
    Sample names differ, readgroups are flowcell- and lane-aware. Everything safe there.

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    @sergey_ko13, should be fine with Mutect2.

  • samuelriverosamuelrivero Washington DCMember


    Is there any Mutect2 option to include the variant type in the vcf file. I mean to include a tag (i.e "indel") in the vcf file. I have have been looking at the --annotation/A but I don't find the argument


  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @samuelrivero,

    There is a module, VariantType, that should be able to do this. This was last available in GATK3 and I do not know if it was ported to GATK4. You can just use GATK3 if it is not available. Taking a glance at the current GATK4 annotations here, I do not see the module.

  • FPBarthelFPBarthel HoustonMember ✭✭

    I found a minor discrepancy in documentation. Here it's suggested to omit --germline-resource from Mutect2 when using it to call all variants on a normal sample for panel-of-normal creation, but in the CreateSomaticPanelOfNormals documention here this parameter is included in the example. I had initially included this but after re-reading this I am removing it from PON calling step.

  • davidbendavidben BostonMember, Broadie, Dev ✭✭✭

    @FPBarthel Good catch! This is my fault because that documentation is generated automatically from the tool javadoc. I will issue a PR to fix the codebase, and perhaps @shlee would know when the documentation gets re-generated.

  • Hi,

    I have some confusion regarding step 5 (OPTIONAL):
    In the part of CollectSequencingArtifactMetrics scripts, there is only "tumor.bam" used as input, but in the resulted "pre_adapter_summary_metrics" examples, there are tables for both HCC1143_tumor and HCC1143_normal. Did this imply that we should run CollectSequencingArtifactMetrics for each BAM file of tumor and normal, separately ?
    Additionally, dId the summary table of HCC1143_normal is used only for verification purpose ? Because I see that in the subsequent filtering step, i.e "FilterByOrientationBias", only "tumor_artifact.pre_adapter_detail_metrics.txt" was used to filter vcf file.

    Thank you very much in advance for your clarification,


  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭
    edited June 2018

    @davidben + @FPBarthel,

    Tool documentation updates have now been automated for v0.0.x.0 releases of GATK. Many thanks to @Geraldine_VdAuwera for this. So the tool docs should show v4.0.5.0 docs, given it was released on June 7.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Actually they're semi-automated -- there's still a small human intervention step needed. Working on getting rid of that too ;)

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @Phuong_Nguyen_1,

    In the part of CollectSequencingArtifactMetrics scripts, there is only "tumor.bam" used as input, but in the resulted "pre_adapter_summary_metrics" examples, there are tables for both HCC1143_tumor and HCC1143_normal.

    I show both for illustration. The normal is a control against which you can gauge the tumor's extent of sequencing artifacts. Sorry for the confusion.

  • Hi @shlee ,

    Thank you very much for your clarification.

    Best regards,

  • Hi,
    One of my colleagues went to the GATK4 course last week in Cambridge, UK and shared the variant calling GATKwr26-S1-Somatic_SNVs_and_indels presentations with me. In this presentation https://drive.google.com/drive/folders/1aJPswWdmMKLSmXB6jjHMltXj6XDjSHLJ on the slide with title Contamination estimation commands and main options pointing how to run gatk GetPileupSummaries for the germline AF resource you add -V af-only-gnomad.vcf.gz. To my understanding this is the same file one uses for running Mutect2 --germline_resource af-only-gnomad.vcf.gz. However, when I look into the GATK resource bundle there is a folder Mutect2/GetPileupSummaries and there the files are small_exac_common_3.hg38.vcf.gz. Also above in your post from March 16 you say to use Mutect2/GetPileupSummaries/small_exac_common_3. Could you please advice me which file to use for GetPileupSummaries - is the presentation the most updated version?
    In the same presentation on the slide with title Mutect2 command and main options it says when running Mutect2 to use --germline_resource af-only-gnomad.vcf.gz --af_of_alleles_not_in_resource 0.0000025. However on the Mutect2 documentation you advice to use --germline-resource af-only-gnomad.vcf.gz --af-of-alleles-not-in-resource 0.00003125 - which is a different allele fraction. Again - which value do you recommend to use for af-of-alleles-not-in-resource for the same file? Is the presentation the most up-to-date version?
    Thank you

  • davidbendavidben BostonMember, Broadie, Dev ✭✭✭

    @BilyanaStoilova for GetPileupSummaries you should use one of the small_exac_common resources -- we have them for hg19 and hg38. Using af-only-gnomad would work, but it would be very slow. As for --af_of_alleles_not_in_resource, we tweaked that part of the code a lot in the last few months but we now recommend not specifying it at all because Mutect2 automatically chooses an appropriate value.

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @BilyanaStoilova,

    Yes, in older iterations of the hands-on exercise or presentations before the Cambridge UK workshop that just concluded last week, this is an error. I do see the presentation showing the same mistake. Apologies:

    @Tiffany_at_Broad, please take note for updates.

    In the Cambridge UK M2 hands-on worksheet at https://drive.google.com/open?id=1mbOUtTuOUVDTEmDeIzkkJPZxpeLL7BnL we use the correct recommended resource:

    If any older iterations of the materials say otherwise, then as @davidben says, the tool will be slow.

  • @davidben & @shlee
    Thank you very much for the clarifications.

  • FPBarthelFPBarthel HoustonMember ✭✭

    Hi @shlee,

    I have a question about PON building.

    When you say:

    Ideally, the PoN includes samples that are technically representative of the tumor case sample--i.e. samples sequenced on the same platform using the same chemistry, e.g. exome capture kit, and analyzed using the same toolchain. However, even an unmatched PoN will be remarkably effective in filtering a large proportion of sequencing artifacts. This is because mapping artifacts and polymerase slippage errors occur for pretty much the same genomic loci for short read sequencing approaches.

    Does that mean it is better to have several smaller PONs per batch? Eg. let's say you have paired tumor-normal samples from the following four batches (all whole genome sequencing):

    N. of normals Read Length Instrument Coverage
    Batch A 5 75 HiSeq 40x
    Batch B 10 100 HiSeq 5x
    Batch C 20 100 HiSeq 40x
    Batch D 40 150 NovaSeq 40x

    Would it be better to have seperate PONs for batches A-D, even when several of these batches are very small in number? Or, would you ideally create a PON with the normals from all these batches added together?

  • Hi @shlee
    I have a question about the flag --max-reads-per-alignment-start. What does it mean? I did not understand the comment and why it is set to 50 by default.
    I have highly deep sequenced data. Is this flag what I need to disable to avoid downsampling? I read somewhere that mutect2 by default downsamples to 1000.

  • davidbendavidben BostonMember, Broadie, Dev ✭✭✭

    @FPBarthel That's a tough call and I don't think anyone can actually give a confident answer. If you forced me to guess I would probably say to pool all the HiSeq together but keep it separate from NovaSeq.

  • davidbendavidben BostonMember, Broadie, Dev ✭✭✭

    @anabbi --max-reads-per-alignment-start is a downsampling parameter. If it is set to 10, for example, then Mutect2 downsamples to 10 reads that start at any given position. If the read length is L, then the average downsampled coverage will be about 10 * L. It is set to 50 by default because in most applications coverage above that amount is almost always due to unmappable regions that suck up an enormous amount of compute time only to produce a pile of false positives.

  • FPBarthelFPBarthel HoustonMember ✭✭
    edited August 2018

    @davidben said:
    @FPBarthel That's a tough call and I don't think anyone can actually give a confident answer. If you forced me to guess I would probably say to pool all the HiSeq together but keep it separate from NovaSeq.

    @davidben I had some more time to think about this. From my understanding of how PON works in Mutect2, from reading and working with the @slee tutorial and also the documentation here, it occurred to me that the PON creation step simply takes variants present in two or more VCF files (of normal samples) and nothing else, unless I am missing something? So, I wonder if it would really be such a bad thing to select multiple of both NovaSeq and HiSeq normal VCF files to be merged together in one PON. It seems unlikely that platform-specific artifacts will be present in both sets of VCFs, and if they are present in 2 or more within a set that's fine because they would still be included? Moreover, it doesn't seem like adding additional VCFs from less optimal sources would reduce the effectiveness of the PON filtration step (eg lets assume you had an additional set from NextSeq, or even a set of samples from FFPE)? Or would it? Thus, I could create seperate PONs for NovaSeq and HiSeq and while it would make M2 mutation calling a bit faster and more efficient (due to having to load/work with a smaller PON), it would not make it "better" by any means. Is this correct?

  • davidbendavidben BostonMember, Broadie, Dev ✭✭✭

    @FPBarthel The issue of an over-inclusive PoN is that it over-filters, reducing sensitivity.

  • I have a question about the population germline variant resource used in Mutect2 and in making the Panel of Normals.

    My main question is this: Can I use a bootstrapped-knownSites.vcf that I generated with HaplotypeCaller in the BQSR process as the --germline-resource variant file for Mutect2 and CreateSomaticPanelOfNormals?

    For clarification: I have 7 tumor-normal sequences of the same mouse strain. I do not have an official population germline variant file and I have only pre-processed one tumor-normal sample so far—can I make a master/combined population --germline-resource by combining all of the bootstrapped knownSites.vcf generated in the BQSR phase of data preprocessing for every sample? Can I run all of my normal.bams through HaplotypeCaller at the same time to generate a master knownSites.vcf in one go?

    More generally: How essential are the -pon and --germline-resource if I'm already using a matched normal?

    The reason I don’t think I can use any “general” germline variant resources is because we are looking to see if our substrain has accumulated normal somatic variants throughout life as well due to the nature of an introduced germline mutation, in addition to any true somatic tumor mutations. Our normals likely have a different germline variant profile than any official germline variant resource of the parent strain.

    Much appreciated and keep up the great work GATK Team!

  • I have a question about GetPileupSummaries. I've got the table with all zeroes in the 3rd, 4th, 5th columns which results in 0.0 contamination.

    At the end of the pipeline, about 10,000 mutations pass the filter. Is it correct?


    I have switched the bam input from tumor.bam to "2_tumor_normal_m2.bam" (from Mutect2 -bamout). It works and about 300 mutations pass the filter.

    I used tumor vs. normal and pon (made of 3 normals).

    gatk GetPileupSummaries -R hg38.fasta -I sample01.bam -V hg38.vcf.gz -O sample01.table

    Note that sample01.bam is the output from Mutect2 with -bamout.

    Which bam is intended to use with GetPileupSummary? If it is the original bam, what's different between the original bam and the bam from Mutect2 with -bamout?

  • Jong_Hwan_KimJong_Hwan_Kim republic of koreaMember

    Hi, @shlee
    I'm really appreciated about this tutorial.

    and i have a question, about section 3.
    gatk GetPileupSummaries \
    -I tumor.bam \
    -V resources/chr17_small_exac_common_3_grch38.vcf.gz \
    -O 7_tumor_getpileupsummaries.table

    I want to use this code, but error messages like this :

    A USER ERROR has occurred: Argument intervals was missing: Argument 'intervals' is required.

    I want to check total region, but the "GetPileupSummaries" looks like need to take intervals.

    --input,-I:String BAM/SAM/CRAM file containing reads This argument must be specified at least once. Required.
    --intervals,-L:String One or more genomic intervals over which to operate This argument must be specified at least once. Required.

    Do i need to make interval file? or Is there anything part I missed?


  • hi @Chatchawit ,

    which vcf did you use for -V ?

    You indeed need Tumor.bam for GetPileupSummaries.
    It calls the common variant sites from vcf (I used small_exac_common_3.hg38.vcf.gz) in your Tumor.bam and assumes the portion of WT reads in Homozygous Variant sites as contaminant. Bamout covers too less sites I guess.

    maybe that thread could be interesting:


  • davidbendavidben BostonMember, Broadie, Dev ✭✭✭

    @Jong_Hwan_Kim In most cases GetPileupSummaries should take use the same resource file for -V as for -L.

    @shlee This change was from a few weeks ago and it's my fault for failing to notify the comms team. Basically, we needed to make GetPileupSummaries a LocusWalker in order to optimize it for the cloud, and this forces the command line to be a bit redundant. The javadoc for the tool is up to date so you should be able to use that to update the tutorial.

  • davidbendavidben BostonMember, Broadie, Dev ✭✭✭

    @Chatchawit In addition to checking about -V (which should be small_exac_common_3.hg38.vcf.gz in most cases) what intervals did you sequences on, WES, WGS, or a small panel?

    You should use the bam, not the bamout, because bamout only retains reads from regions where Mutect2 performs local assembly.

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Thanks for the update @davidben. The gatk release notes omit mention of GetPileupSummaries. Apologies from my end for not catching that! I do see your PR with the relevant changes from August 2. So this change applies to GATKv4.0.8.0+. I will add a small note to the tutorial for now.

  • xiuczxiucz Member

    Since GATK4's Mutect2 is quite different from the old version MuTect, is there a logic or mathematic document to interpret Mutect2 result like this ,[4464-how-mutect-filters-candidate-mutations]( I cannot post the link).
    If not, I guess it is the time to undertake it now.

  • davidbendavidben BostonMember, Broadie, Dev ✭✭✭
  • namratapatelnamratapatel AnandMember

    I have samples of 4 patients and we have taken two normal samples, how can i call variants using Mutect2 if i dont have matching normal and patient pairs and i want to use both my normal samples.

    Thank You

  • Jong_Hwan_KimJong_Hwan_Kim republic of koreaMember

    @shlee @daviden
    Dear GATK develop team,

    I have a problem with the GetPileupSummaries step.

    It was executed by the following command.

    gatk GetPileupSummaries
    -R genome.fa \
    -I Tumor.bam \
    -V GATK4/small_exac_common_3_b37_chr.vcf.gz \
    -L GATK4/small_exac_common_3_b37_chr.vcf.gz \
    -O Tumor_getpileupsummaries.table

    My reference genome starts with 'chr', so I changed the small_exac_common_3_b37_chr.vcf.gz file to start with chr. (including the part '## contig = <ID =')

    I showed the following error message.

    "java.lang.NumberFormatException: For input string:" POS ""

    Is there anything I missed?

    Thank you for answer.

    Jong Hwan, Kim.

  • Jong_Hwan_KimJong_Hwan_Kim republic of koreaMember

    I think i find solution about my answer,

    i feel sorry about GATK develop team,

    but maybe someone can suffer same problem,

    so, i want to reply my question.

    the problem is following massages,

    gatk GetPileupSummaries
    -R genome.fa \
    -I Tumor.bam \
    -V GATK4/small_exac_common_3_b37_chr.vcf.gz \
    -L GATK4/small_exac_common_3_b37_chr.vcf.gz \
    -O Tumor_getpileupsummaries.table

    "java.lang.NumberFormatException: For input string:" POS ""

    My reference genome starts with 'chr', so I changed the small_exac_common_3_b37_chr.vcf.gz file to start with chr.

    In this process, unzip and re-zip using bgzip (not gzip),

    and use ./gatk- IndexFeatureFile -F xx.vcf.gz (do not tabix -f vcf xx.vcf).

    or, actually, i used 'small_exac_common_3_b37_chr.vcf' with out bgzip and just use ./gatk- IndexFeatureFile -F xx.vcf

    maybe anyone has not this problem, but i hope that this small solution can help someone.


    Jong hwan, kim.

  • kchang3kchang3 Member

    I'm running getPileupSummaries step on a TCGA BAM mapped with GRCh37

    Here's my command
    gatk- GetPileupSummaries \
    -I TCGA-HD-8635-11A-01D-2394-08.bam \
    -V small_exac_common_3_b37.vcf.gz \
    -L small_exac_common_3_b37.vcf.gz \
    -O TCGA-HD-8635-11A-01D-2394-08.pileup.table

    and receive the following error. I saw some old post about old bam index causing such error. so i regenerate the index with samtools 1.8. but I still get the same error. Any suggestions on what the problem could be?

    12:09:55.633 INFO GetPileupSummaries - Initializing engine
    12:09:55.994 INFO FeatureManager - Using codec VCFCodec to read file file://small_exac_common_3_b37.vcf.gz
    12:09:56.037 INFO FeatureManager - Using codec VCFCodec to read file file://small_exac_common_3_b37.vcf.gz
    12:09:56.326 INFO IntervalArgumentCollection - Processing 60040 bp from intervals
    12:09:56.344 INFO GetPileupSummaries - Done initializing engine
    12:09:56.344 INFO ProgressMeter - Starting traversal
    12:09:56.344 INFO ProgressMeter - Current Locus Elapsed Minutes Loci Processed Loci/Minute
    12:09:58.248 INFO GetPileupSummaries - Shutting down engine
    [January 8, 2019 12:09:58 PM CST] org.broadinstitute.hellbender.tools.walkers.contamination.GetPileupSummaries done. Elapsed time: 0.05 minutes.
    htsjdk.samtools.SAMFormatException: Invalid GZIP header
    at htsjdk.samtools.util.BlockGunzipper.unzipBlock(BlockGunzipper.java:121)
    at htsjdk.samtools.util.BlockGunzipper.unzipBlock(BlockGunzipper.java:96)
    at htsjdk.samtools.util.BlockCompressedInputStream.inflateBlock(BlockCompressedInputStream.java:550)
    at htsjdk.samtools.util.BlockCompressedInputStream.processNextBlock(BlockCompressedInputStream.java:532)
    at htsjdk.samtools.util.BlockCompressedInputStream.nextBlock(BlockCompressedInputStream.java:468)
    at htsjdk.samtools.util.BlockCompressedInputStream.seek(BlockCompressedInputStream.java:380)
    at htsjdk.tribble.readers.TabixReader$IteratorImpl.next(TabixReader.java:427)
    at htsjdk.tribble.readers.TabixIteratorLineReader.readLine(TabixIteratorLineReader.java:46)
    at htsjdk.tribble.TabixFeatureReader$FeatureIterator.readNextRecord(TabixFeatureReader.java:170)
    at htsjdk.tribble.TabixFeatureReader$FeatureIterator.(TabixFeatureReader.java:159)
    at htsjdk.tribble.TabixFeatureReader.query(TabixFeatureReader.java:133)
    at org.broadinstitute.hellbender.engine.FeatureDataSource.refillQueryCache(FeatureDataSource.java:595)
    at org.broadinstitute.hellbender.engine.FeatureDataSource.queryAndPrefetch(FeatureDataSource.java:564)
    at org.broadinstitute.hellbender.engine.FeatureManager.getFeatures(FeatureManager.java:308)
    at org.broadinstitute.hellbender.engine.FeatureContext.getValues(FeatureContext.java:163)
    at org.broadinstitute.hellbender.engine.FeatureContext.getValues(FeatureContext.java:115)
    at org.broadinstitute.hellbender.tools.walkers.contamination.GetPileupSummaries.apply(GetPileupSummaries.java:195)
    at org.broadinstitute.hellbender.engine.LocusWalker.lambda$traverse$0(LocusWalker.java:176)
    at org.broadinstitute.hellbender.engine.LocusWalker$$Lambda$86/1541427914.accept(Unknown Source)
    at java.util.Iterator.forEachRemaining(Iterator.java:116)
    at org.broadinstitute.hellbender.engine.LocusWalker.traverse(LocusWalker.java:174)
    at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:979)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:137)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:182)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:201)
    at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
    at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
    at org.broadinstitute.hellbender.Main.main(Main.java:289)

  • sutturkasutturka Member

    Following the comment Even an unmatched PoN will be remarkably effective in filtering a large proportion of sequencing artifacts.

    I would like to know if WGS and Exome PON are intercompatible? I have a PON created with WGS data. However, my samples are exome with matched normal.

    Using WGS PON with exome sample is correct OR I should rely only on the matched normal? If yes, I suppose I can use the --af-of-alleles-not-in-resource value calculated as 1/(2*WGS samples).

  • davidbendavidben BostonMember, Broadie, Dev ✭✭✭

    @sutturka This is trickier because WGS and WES don't necessarily have the same mapping artifacts, even in the exome. Nonetheless, I think a WGS panel is better than none at all.

  • sutturkasutturka Member

    I think I am seeing high level of contamination in my samples. In two WES samples, I get - 0.426 and 0.568 as contamination values and about 4,000/30,000 calls are filtered solely for contamination. This seems to be a very high. What is the generally acceptable value with this? I am using GATK version Do you have any suggestions regarding this?

    Commands with Exome data:

    gatk GetPileupSummaries   \
    --input  tumor.bam  \
    --variant  SNP.INDEL.BIALLELIC.vcf.gz  \
    --output  tumor_pileup_summary.table  
    gatk GetPileupSummaries   \
    --input  normal.bam  \
    --variant  SNP.INDEL.BIALLELIC.vcf.gz  \
    --output  normal_pileup_summary.table  
    gatk CalculateContamination   \
    --input  tumor_pileup_summary.table  \
    --matched-normal  normal_pileup_summary.table  \
    --output  contamination.table  
  • davidbendavidben BostonMember, Broadie, Dev ✭✭✭

    @sutturka I would definitely try using the most recent release for this.

  • sutturkasutturka Member

    Thanks @davidben. I was using the older version for consistency with other samples ran previously. I tested this with the latest version on our system GATK/ and the contamination values are:

    sample GATK/ GATK/
    S1 0.56857513 0.567995399
    S2 0.426566801 0.442471985

    This seems to be a very high. What is the generally acceptable value for contamination? I am trying to get an idea to what extent I should trust the results and any other adjustments I should make to be stringent/lenient for contamination detection?

    Also, the PASS mutation calls are quite different for sample S2. Is this expected?

    sample GATK/ GATK/
    S1 204 206
    S2 843 751

    Thank you for the help.

  • davidbendavidben BostonMember, Broadie, Dev ✭✭✭

    @sutturka I agree, those are very high estimates. The recent improvements (both enhancements in the model and bug fixes) to CalculateContamination were in the release, so I'm not surprised that didn't change much. Are you able to run with a jar?

    Some more questions, which you don't need to answer if fixes everything, are:

    • what reference are your bams aligned to?
    • how much CNV is there in your samples?
    • what is the depth like?
    • what is CalculateContamination's --tumor-segmentation output?
    • do the results change when you give CalculateContamination the argument --low-coverage-ratio-threshold 0.8? (Note: I am not saying that this is in any way the best practice. It's just a way to diagnose.)

    Given the high contamination estimate I'm not surprised at the difference between S1 and S2, but I am just as skeptical as you about that estimate.

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭
    edited January 2019

    Hi @sutturka and @davidben,

    With estimated contamination levels at around 43-57%, you might want to consider whether data from a different individual may have accidentally ended up in your sample, e.g. at the read group level or in a manner that may lead you to toss the sample.

    For read group level mixups, you can use CrosscheckFingerprints. Also, you can corroborate contamination estimates with Picard IdentifyContaminant. The methods for these tools are described in a yet-to-be-merged whitepaper which you can find at https://github.com/broadinstitute/picard/pull/1247.

    If the mix-up is at the read group level, you can simply remove the contaminating read group data. If not, then I have a chart with data interpretation guidelines that used to be a part of the GATK workshop presentations long long ago. In this chart, for 15-50% contamination, which is considered heavy, the Broad Genomics Platform recommends removing the sample and following up with the project manager. For >50% contamination, consider whether there has been a sample swap. Again follow up with your project manager.

  • ddaneelsddaneels Member
    edited February 2019


    I'm not sure that I understand the use of the --germline-resource and --af-of-alleles-not-in-resource parameters

    My input bam comes from a human tumor-only gene panel (20 genes).

    So I'm using the following parameters:
    --germline-resource af-only-gnomad.raw.sites.hg19.vcf.gz
    --af-of-alleles-not-in-resource 0.0000025

    Am I correct that this will discard all variants that are present in the gnomAD.vcf file with an AF > 0.0000025 from my data? Or are those variant present in my resulting .vcf file, but with some kind of annotation?

    I don't understand why the --af-of-alleles-not-in-resource is adapted based on the amount of data in the resource?

  • davidbendavidben BostonMember, Broadie, Dev ✭✭✭

    @ddaneels --af-of-alleles-not-in-resource is the imputed allele fraction of variants not found in gnomAD used by the germline and contamination probabilistic models. It does not imply any hard filtering threshold. It depends on the amount of data in the resource because the bigger the resource the rarer you can assume an allele missing from it is. Also, we now recommend leaving this parameter unspecified (except in cases where people are not using gnomAD) as we improved the defaults several releases ago.

  • pwaltmanpwaltman New York, NYMember

    Has this been updated for the recently released gatk- I see that there's a new filter FilterAlignmentArtifacts - how would that work into this current pipeline?

  • ddaneelsddaneels Member

    @davidben so you mean that I don't need to specify the parameter --af-of-alleles-not-in-resource 0.0000025 when I'm using --germline-resource af-only-gnomad.raw.sites.hg19.vcf.gz. In that way the default value of --af-of-alleles-not-in-resource is used?

  • davidbendavidben BostonMember, Broadie, Dev ✭✭✭

    @ddaneels That's right. Leave it out of the command in order to use the default.

  • Crispy13Crispy13 Member

    Should normal samples be pre-processed like tumor samples with the method described in best practice?

  • rshahirshahi BrusselMember

    Hi @shlee,
    Some questions about CreateSomaticPanelOfNormals(GATK/

    1. Is the header in the final PON vcf is just copied from one of the vcfs included for the creation of PON?

    2. Is it possible to know in how many samples(normals) a variant is detected (if a variant is present in more than 2 samples)?


  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @Crispy13. You should preprocess your normals just as you do your tumor. In fact, I should think you want to joint process your normal and tumor samples if possible.

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭
    edited March 2019

    Hi Raj (@rshahi),

    Is the header in the final PON vcf is just copied from one of the vcfs included for the creation of PON?

    Yes. It appears v4.1.0.0 CreateSomaticPanelOfNormals copies over each type of metadata line from the input VCFs and for those lines present in multiple samples, copies over that from the first provided VCF. You can easily check this with e.g.

    diff <(gzcat mutect2_precomputed/6_threesamplepon.vcf.gz | grep '##') <(gzcat mutect2_precomputed/3_HG00701.vcf.gz | grep '##') 

    What this means is one-sample-specific information will be present in the PoN header, e.g. in the copied over ##GATKCommandLine and ##tumor_sample metadata lines for the given tutorial data. You can test for yourself using GATK Workshop data bundles, which are all publically available. I provide three VCFs towards M2 PoN creation.

    Is it possible to know in how many samples(normals) a variant is detected (if a variant is present in more than 2 samples)?

    Not for v4.1.0.0. The only thing you know for certain is that the PoN will contain variants present in at least two samples given the default value of --min-sample-count, which is two. You can of course change this value to your liking.

    All this being said, you should know that there are changes afoot for the next release (v4.1.1.0) on how CreateSomaticPanelOfNormals handles header information from samples. Please checkout https://github.com/broadinstitute/gatk/issues/5609. If you are unfamiliar with GitHub, please see Article#23405 for a quickstart. As you can see, the tool will soon output all sample names as well as the number of samples the variant was found in.

    One last note. GATK3 CombineVariants (unavailable in GATK4), will produce a count of the number of samples with the variant in the INFO field's AC annotation. Previous to the existance of CreateSomaticPanelOfNormals, (for GATK3 and I believe alpha and beta versions of GATK4) GATK recommended generating the PoN with CombineVariants (+MakeSitesOnlyVcf).

    Hope this is helpful.

  • rshahirshahi BrusselMember

    Hi @shlee, I appreciate your clarification. KR

  • rshahirshahi BrusselMember

    Hi @shlee,

    Question regarding CalculateContamination(GATK/

    With CalculateContamination in tumor(only) mode, I get:
    contamination error
    0.32779720279720276 0.9446741568277018

    But in matched normal mode I get an error like this:
    contamination error
    NaN 1.0

    Then for curiosity, if I run in normal only, I get;
    NaN 1.0

    When I look at the tumor.table and normal. table files generated by getpileupsummaries, I don't see any unusual data structure/value

    What can be the problem?

    For your reference, the tumor/normal. table files are attached herewith.


  • davidbendavidben BostonMember, Broadie, Dev ✭✭✭

    @rshahi There are very few sites in these data. Is it from a gene panel? We don't recommend calculating contamination in this case because there aren't enough germline hom alt sites (in this case I don't see any) to make a reliable estimate.

  • rshahirshahi BrusselMember

    Dear @davidben, thanx a lot for triggering me go back to my code. It turned out that value for -L argument in GetPileupSummaries was mistakenly given foo.vcf generated from Mutect2 instead of interval file (exome), that's why there were few sites in that data.

Sign In or Register to comment.