Examples: Monday, today, last week, Mar 26, 3/26/04

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

You can opt in to receive email notifications, for example when your questions get answered or when there are new announcements, by following the instructions given here.

☞ Formatting tip!

Wrap blocks of code, error messages and BAM/VCF snippets--especially content with hashes (#)--with lines with three backticks ( ` ) each to make a code block as demonstrated here.

HEADS UP : The planned released has been pushed back to Monday 4/24. See the Blog for release notes.

Broad Mutation Calling Best Practice Workflows

edited September 2016

The Broad Mutation Calling Best Practice Workflow

Overview

The Broad Mutation Calling Best Practice Workflow is a somatic mutation and variant analysis pipeline. It is broken down into four component workflows in FireCloud, QC, Copy Number, MuTect and Filtering. Three of these will have both Open Access (OA) and Controlled Access (CA) “Best Practice” workspaces (Copy Number, MuTect and Filtering) because they each contain a “Panel of Normals” (PoN).

The Open Access workspaces of the Broad Mutation Calling Best Practice Workflow are available to all FireCloud users, and contain paired cell line BAMs. Controlled Access workspaces will contain PoNs with protected data that requires prior dbGaP authorization to access. Go here for more information about obtaining access to these workspaces.

Best Practice Workspaces: Open Access (OA)

Best Practice Workspaces: Controlled Access (CA)

If you have access to a FireCloud Billing Project, you can clone any of these workspaces and launch an analysis through the Method Configurations tab.

Note: all methods that contain ‘_BETA’ in the name, are still undergoing active development.

Runtime

The total expected runtime for this workflow depends on the size of the pair or pair set you select for analysis. Pair HCC1954_100_gene_pair runs on "tiny" 100-gene BAMs and the expected runtime is roughly 15 minutes. Pair HCC1143_WE_pair runs on Whole Exome BAMs and the expected runtime is roughly 2.5 hours.

The QC task counts reads overlapping regions for tumor and normal BAM files. The task concludes with a report of the counts over the BAMs and lanes. Correlation values are included for comparison purposes.

ContEst uses a Bayesian approach to calculate the posterior probability of the contamination level and determine the maximum a posteriori probability (MAP) estimate of the contamination level.

ContEst supports array-free mode, where we genotype on the fly from matched normals, and use that as our source of homozygous variant calls. It currently calls anything with > 80% of bases as the alternate with at least 50X coverage a homozygous alternate site.

Picard Metrics Tasks invoke multiple metrics reporting routines from the Picard toolkit. The following routines are invoked:

• CollectAlignmentSummaryMetrics

• CollectInsertSizeMetrics

• QualityScoreDistribution

• CollectInsertSizeMetrics

• QualityScoreDistribution

• MeanQualityByCycle

• CollectBaseDistributionByCycle

• CollectSequencingArtifactMetrics

• CollectQualityYieldMetrics

• CollectOxoGMetrics.

SAM/BAM validation routines (ValidateSamFile) in both summary and verbose mode are called. Other routines such as CollectHsMetrics, and MarkDuplicates are also called.

Nozzle Report

Analysis results from QC, ContEst, and Picard Metrics are populated to HTML reports that you can view through the workspace Data tab. This report structures the results into tables.

Inputs and Outputs

Below are the tool-specific inputs and outputs for this workflow. Note that these inputs and outputs may vary from the “standard” version of the tool.

Inputs

• File picardHapMap

• File DB_SNP_VCF

• File HaplotypeDBForCrossCheck

• File tumorBamIdx

• String tumorBamGB

• File picardTargetIntervals

• String pairName

• File normalBam

• String pairBamGB

• String normalBamGB

• File captureNormalsDBRCLZip

• File refFasta

• File tumorBam

• File SNP6Bed

• String validationStringencyLevel

• File exomeIntervals

• File refFastaIdx

• File normalBamIdx

• File picardBaitIntervals

• File regionFile

• File DB_SNP_VCF_IDX

• Integer preemptibleVal

• File HapMapVCF

Outputs

• String fraction_contamination

• File crossCheckStats

• File crossCheckReport

• File metricsReportsZip

• File tumor_validation

• File hsMetrics

• File metricsReportsZip

• File normal_validation

• File normalhsMetrics

• File QC_nozzle_report

• File copy_num_report_html

How to Run this Module in FireCloud

• Clone the Broad_MutationCalling_QC_Workflow_BestPractice workspace to run this workflow.

• In your cloned workspace, navigate to the Method Configurations tab and click on the method config, MutationCalling_QC_v1-1_BETA_cfg.

• Click Launch Analysis.

• In the Launch Analysis window, toggle to pair and select a pair on which to run this workflow, e.g., HCC1143_WE_pair. You can also run this workflow on a pair set. To do so, toggle to pair_set, select HCC_pairs and enter this.pairs in the Define Expression field.

• Finally, click the Launch button. Check back on the Monitor tab later to view results from your workflow analysis.

• You can view results and outputs by clicking on the most recent analysis run, e.g., HCC1143_WE_pair.

• Click on Outputs: Show, then select output files to view the results of this analysis.

The Broad Mutation Calling MuTect workflow runs on a tumor and normal pair or pair set, and consists of three main steps: 1) preparation, 2) somatic mutation analysis, and 3) somatic mutation annotation. As a part of this workflow, MuTect is also run in force-call mode to search for somatic variants at a set of specified loci for clinical relevance (for example, TP53 gene on chromosome 17).

The three main steps are described here:
1. Preparation: The first step prepares input parameters, including the Intervals file, for analysis by MuTect1 and MuTect2. This step may be known as “scatter preparation.”

2. Somatic Mutation Analysis: 2 somatic variant analyses are carried out in parallel: MuTect1 and MuTect2. MuTect1 is used for its somatic single-nucleotide variant detection capabilities. MuTect2 is used for its indel detection capabilities. MuTect1 and Mutect2 are run in parallel over the Intervals file. This step may be known as “scatter”.

3. Somatic Mutation Gather: This final step serves to annotate variants found in the previous step. First the results are gathered and merged. Then, from the MuTect2 output, the indel-variants are separated and merged with the single-nucleotide variants from MuTect1. Next, only variants tagged with PASS are fed to Oncotator for annotation with Oncotator. Finally, all variants are passed to the VEP (Variant Effect Predictor) program for further annotation.

Runtime

The total expected runtime for the Broad Mutation Calling MuTect workflow depends on the size of the pair you select for analysis. Pair HCC1954_100_gene_pair runs on "tiny" 100 gene BAMs and the expected runtime is roughly 45 minutes. Pair HCC1143_WE_pair runs on Whole Exome BAMs and the expected runtime is roughly 1.5 hours.

MuTect1

MuTect1 is the original DREAM challenge-winning somatic point mutation caller (Cibulskis et al., 2013. It identifies somatic point mutations in next generation sequencing data of cancer genomes. It inputs sequencing data for matched normal and tumor tissue samples, and outputs mutation calls and optional coverage results.

In a nutshell, the analysis itself consists of three steps:

• Pre-process the aligned reads in the tumor and normal sequencing data

• Identify using statistical analysis and a Panel of Normals, sites that are likely to carry somatic mutations with high confidence. (See the Panel of Normals section for more information.)

• Post-processing of candidate somatic mutations

For complete details, please see the 2013 publication in Nature Biotechnology.

MuTect2

MuTect2 is a somatic SNP and indel caller that combines the original MuTect (Cibulskis et al., 2013) with the assembly-based machinery of HaplotypeCaller.

The basic operation of MuTect2 proceeds similarly to that of the HaplotypeCaller.
While the HaplotypeCaller relies on a ploidy assumption (diploid by default) to inform its genotype likelihood and variant quality calculations, MuTect2 allows for a varying allelic fraction for each variant, as is often seen in tumors with purity less than 100%, multiple subclones, and/or copy number variation (either local or aneuploidy). MuTect2 also applies some hard filters to variants before producing output.

Oncotator

Oncotator is a tool for annotating information onto genomic point mutations (SNPs/SNVs) and indels. It is primarily used for human genome variant callsets. However, the tool can also be used to annotate any kind of information onto variant callsets from any organism.

By default, Oncotator uses a simple TSV file (e.g., MAFLITE) as an input and produces a TCGA-format MAF as an output. Oncotator also supports VCF files as an input and output format. By extension, Oncotator can be configured to annotate genomic point mutation data with HTML reports as it does in this workflow.

VEP (Variant Effect Predictor)

Ensembl’s VEP (Variant Effect Predictor) program processes variants for further annotation. This tool annotates variants, determines the effect on relevant transcripts and proteins, and predicts the functional consequences of variants.

VEP accepts as input coordinates any identified alleles or variant identifiers. A full list of input files is available here. If a variant that you enter as input causes a change in the protein sequence, the VEP will calculate the possible amino acids at that position and the variant would be given a consequence type of missense.

Nozzle Report

The final step of the workflow combines the analysis results from MuTect1, MuTect2, Oncotator, and VEP into a single HTML report. Our method configuration instructs FireCloud to write a link to the report back to the workspace data model, allowing you to view the report from the workspace Data tab. This report structures filtered results and annotations into tables.

Inputs and Outputs

Inputs

• File tumorBam

• File tumorBamIdx

• File normalBam

• File normalBamIdx

• File refFastaIdx

• File mutectIntervals

• File mutectForcecallIntervals

• File refFasta

• String fracContam

• File dbSNPVCF

• File cosmicVCF

• String downsampleToCoverage

• File normalPanel

• File oncoDBTarBall

• File VEP_File

• Int nWay

• String mutectGBDiskVal

• String oncoVEPDiskVal

• String pairName

Outputs

• File mutectfc_pw

• File mutectfc_cw

• File preannotated_vcf

• File oncotator_log

• File mutect1Merged

• File mutect2Merged

• File VEP_VCF

• File VEP_Output

• File VEP_Report

• File nozzleReport

• File nozzleRData

How to Run this Module in FireCloud

• Clone the Broad_MutationCalling_MuTect_Workflow_BestPractice workspace to run this workflow.

• In your cloned workspace, navigate to the Method Configurations tab and click on the method config, MutationCalling_Mutect_v1-2_BETA_cfg.

• Click Launch Analysis.

• In the Launch Analysis window, toggle to pair and select a pair on which to run this workflow, e.g., HCC1143_WE_pair. You can also run this workflow on a pair set. To do so, toggle to pair_set, select HCC_pairs and enter this.pairs in the Define Expression field.

• Finally, click the Launch button. Check back on the Monitor tab later to view results from your workflow analysis.

• When the status displays Done, select the most recent analysis run, e.g., HCC1143_WE_pair.

• Click on Outputs: Show, then select output files to view the results of this analysis.

Broad Mutation Calling Copy Number Workflow

The Copy Number workflow uses two types of information from sequencing data to detect copy number variations (CNVs). First, targets (usually exons, but any genomic locus) with abnormally high or low coverage suggest amplifications or deletions, respectively. Second, sites that are heterozygous in a normal sample and have allele ratios significantly different from 1:1 in the matched tumor sample imply a CNV event involving one or both alleles.

The workflow is correspondingly split into two major portions:

1. GATK CNV: Using coverage data that has been normalized against a Panel of Normals (PoN) to remove sequencing noise, targets are partitioned into segments that represent the same copy-number event. (See the Panel of Normals section for more information.) In GATK CVN, segmentation is then performed by a circular-binary-segmentation (CBS) algorithm, developed to segment noisy array copy-number data. Amplifications, deletions, and copy-neutral regions are then called from the segmentation.

2. GATK ACNV: Heterozygous sites are identified in the normal case sample and segmented, again using CBS, according to their ref:alt allele ratios in the tumor sample. These allele-fraction segments are combined with the copy-ratio segments found by GATK CNV to form a common set of segments. Modeling of both the copy ratio and minor allele fraction of each segment is alternated with the merging adjacent segments that are sufficiently similar according to this model, until convergence.

Steps

1. Coverage Collection

2. Process Panel of Normals

3. Segmentation by Tangent-normalized Coverage

4. Calling of Events from Coverage Segments

5. Collection of Allele Counts at HET Sites

6. Segmentation by Minor Allele Fraction

7. Union of Copy-ratio and Minor-allele-fraction Segments

8. Small-segment Merging

9. Model Fitting

10. Similar-segment Merging

Runtime

The total expected runtime for the Broad Mutation Calling Copy Number workflow depends on the size of the pair you select for analysis. Pair HCC1954_100_gene_pair runs on "tiny" 100 gene BAMs and the expected runtime is roughly 45 minutes. Pair HCC1143_WE_pair runs on Whole Exome BAMs and the expected runtime is roughly 1.5 hours.

Inputs and Outputs

Inputs

• File recapseg_bed

• File ref_fasta

• File ref_fasta_dict

• File ref_fasta_fai

• File common_snp_list

• File tumor_bam

• File tumor_bam_idx

• File normal_bam

• File normal_bam_idx

• File PoN

• String Disk_GB

• String entity_id

• Int PreemValue

Outputs

• File gatk_cnv_coverage_file

• File gatk_cnv_seg_file

• File gatk_cnv_tn_coverage

• File gatk_cnv_pre_tn_coverage

• File gatk_acnv_seg_file

How to Run this Module in FireCloud

• Clone the Broad_MutationCalling_CN_Workflow_BestPractice workspace to run this workflow.

• In your cloned workspace, navigate to the Method Configurations tab and click on the method
config, MutationCalling_CN_v1-0_BETA_cfg.

• Click Launch Analysis.

• In the Launch Analysis window, toggle to pair and select a pair on which to run this workflow, e.g., HCC1143_WE_pair. You can also run this workflow on a pair set. To do so, toggle to pair_set, select HCC_pairs and enter this.pairs in the Define Expression field.

• Finally, click the Launch button. Check back on the Monitor tab later to view results from your workflow analysis.

• You can view results and outputs by clicking on the most recent analysis run, e.g., HCC1143_WE_pair.

• Click on Outputs: Show, then select output files to view the results of this analysis.

The Broad Mutation Calling Filtering workflow filters out somatic mutations that are more consistent with artifacts or germline mutations. This Filtering workflow takes as an input a VCF file that is generated by the Broad Mutation Calling MuTect workflow. Note that we provide this file as a workspace attribute onco_no_ann_out for the pre-loaded pair HCC1143_WE_pair.

Before filtering tasks are run, the VCF file is converted to a MAF file with Oncotator and initial PreAdapterDetailMetrics metrics are collected from the BAM file using a customized Picard command. The VCF-to-MAF conversion is done with some minimal annotation to add genomic context information to the generated VCF. Then, the MAF PoN Filter task builds a likelihood model based on sequencing conditions in a Panel of Normals (PoN) to filter out somatic mutations. The FFPE Filter and OxoG Filter tasks screen out Formalin-Fixed, Paraffin-Embedded (FFPE) and oxoguanine (OxoG) artifacts respectively. Finally, a MAF Merge task combines all filtered results into a single VCF file and the Oncotation task runs Oncotator to annotate the VCF file.

VCF to MAF Converter

The MAF PoN Filter task takes as a primary input a MAF file. In order to accommodate the VCF outputs from the MuTect workflow, this task converts a VCF file to MAF format, using the Oncotator tool.

Generate Metrics

The Generate Metrics task collects statistics on the BAM files that are needed for the subsequent filtering tasks.

MAF PoN Filter

The MAF PoN Filter uses a likelihood model to compare somatic mutations against a Panel of Normals (PoN) in order to screen out somatic mutations. The PoN represents sequencing conditions in the case sample, including germline variants and technical artifacts. Refer to the Panel of Normals section for more information.

FFPE Filter

FFPE introduces multiple types of DNA damage including deamination, which converts cytosine to uracil and leads to downstream mispairing in PCR: C>T / G>A. Because deamination occurs prior to ligation of palindromic Illumina adapters, likely deamination artifacts will have a read orientation bias. The FFPE Filter Task uses this read orientation to identify artifacts and calculate a Phred scaled Q-score for FFPE artifacts.

OxoG Filter

Oxidation of guanine to 8-oxoguanine is one of the most common pre-adapter artifacts associated with genomic library preparation, arising from a combination of heat, shearing, and metal contaminates in a sample). The 8-oxoguanine base can pair with either cytosine or adenine, ultimately leading to G→T transversion mutations during PCR amplification. The OxoG Filter Task detects and screens out these artifacts.

MAF Merge

After the filtering steps are completed, the MAF Merge task takes the filtered variants from the three output MAF files (OxoG Filter, FFPE Filter and MAF PoN Filter), and adds them to the original VCF file.

Oncotation

The Oncotation task adds annotations to the VCF File using the Oncotator tool.

Runtime

The total expected runtime for the Broad Mutation Calling Filtering workflow depends on the size of the pair you select for analysis. Pair HCC1143_WE_pair runs on Whole Exome BAMs and the expected runtime is roughly 2.5 hours.

Inputs and Outputs

Inputs

• File DBSNP

• String oxoGOBF ALTALLELE

• Integer CODING ONLY

• File oncotation TarBall

• String ffpeOBF ALTALLELE

• String PairID

• File Generate Metrics

• File PoNFile

• Float WCUT

• Float thresh

• String oxoGOBF Context

• File DBSNPIDX

• File tumor BAM

• String NMIN

• File cytoBandFile

• String ffpeOBF Context

• String TOTNStr

• File REFERENCE

• File oncoDBTarBall

• String MIN_ALT_COUNT

• String ffpe OBF Stub

• File inVCF

• String TUM ID

• File tumor BAM IDX

• String oxoGOBF STUB

Outputs

• File Generate Metrics

• File MAF PoN Filter Zip

• File oxoG MatLab And Filter Data Zip

• File ffpe MatLab And Filter Data Zip

• File Merged Filtered VCF

• File Annotated Filtered VCF

How to Run this Module in FireCloud

• Clone the Broad_MutationCalling_Filter_Workflow_BestPractice_CA workspace to run this workflow. Note that this is a Controlled Access workspace because the PoN contains identifiable data.
• In your cloned workspace, navigate to the Method Configurations tab and click on the method, MutationCalling_Filter_v1-1_BETA.
• Click Launch Analysis.
• In the Launch Analysis window, toggle to pair and select a pair on which to run this workflow. The preloaded pair is HCC1143_WE_pair.
• Finally, click the Launch button. Check back on the Monitor tab later to view results from your workflow analysis.
• You can view results and outputs by clicking on the most recent analysis run, e.g., HCC1143_WE_pair.
• Click on Outputs: Show, then select output files to view the results of this analysis.

Panel of Normals

The Panel of Normals (PoN) plays two important roles in the somatic mutation and copy number variant analysis and filtering workflows: (1) it excludes germline mutation or variant sites that are found in the normals to avoid calling them as potential mutations or variants in the tumor; and (2) it excludes technical artifacts that arise from particular techniques (e.g., sample preservation) and technologies (e.g., library capture, sequencing chemistry). An effective PoN consists of normals that are technically similar to and representative of the tumor samples, as well as sequenced on the same platform. Three of the four Broad Mutation Calling modules (Mutect, CN and Filtering) require use of PoNs.

As the TCGA PoNs contain identifiable data and are therefore considered Controlled Access, we will provide both Open Access and Controlled Access Best Practice Workspaces containing, respectively, Open and Controlled Access PoNs. PoNs are referenced in the workspace as Workspace Attributes.

Please note that the Open Access PoNs were created from a limited number of low-coverage 1000 Genomes samples. As such, these Open Access PoNs will not deliver adequate filtering results and will have a higher rate of false positives than well matched PoNs. The TCGA Controlled Access PoNs are a well-curated set of approximately 1000 TCGA normals.

We recommend, when importing your own data to FireCloud, that you build your own PoN using instructions provided within the GATK forum. When constructing a PoN, it is very important to use normals that are technically similar to and representative of your tumor samples, as well as sequenced on the same platform.

Open Access PoN

Open Access PoNs are publically available for MuTect and Copy Number.

• MuTect: refseq_exome_10bp_hg19_300_1kg_normal_panel.vcf

• Copy Number: open_access_pon_from_1000g.pon

These Open Access PoNs comprise a limited number of consented whole exome samples from the 1000 Genomes project: 300 samples for the MuTect PoN and 20 samples for the Copy Number PoN.

Controlled Access PoN

Users with dbGaP approval for controlled access TCGA data may access and use the Controlled Access PoNs, e.g., Filtering: final_summed_tokens.hist.bin for the Filter workflow. The Controlled Access PoNs comprise approximately 1000 carefully curated TCGA normals sequenced with Agilent, and span across multiple TCGA cohorts (excluding blood cancer). This will result in fewer false positives than the Open Access PoNs.

References

Post edited by jneff on
Tagged:

• Member Posts: 21

Hi, a question on the MuTect2 aspect of the workflow, you have "fracContam" as input, where are you making this (ContEST?) Can you give a little information on any filtering etc of output to make this file?

Thanks, Bruce.

edited June 2016

Hi Bruce,

If you were to clone the Broad_MutationCalling_QC_Workflow_V1_BestPractice workspace and run an analysis, the primary ContEST output would be the fracContam estimate, which I believe is 0.02 or 2% for the pre-loaded data within that workspace.

Within the separate Broad_MutationCalling_MuTect_Workflow_V1_BestPractice workspace, you'll find fracContam as an input string (number) that you can edit at runtime within the Method Configuration. The default input is 0.02, I believe because the two workspaces are referring to the same pre-loaded data.

Thanks,
Jason

• UCSFMember Posts: 2

Would you be able to give me a quick summary of how to interpret the outputs for this tutorial, as well as which output files are most relevant/informative when determining mutations? I am especially interested in understanding the MuTect1 and MuTect2 Outputs, Oncotator and VEP Outputs, and QC Outputs.

Thanks,
Heydi

Hi Heydi,

Please stay tuned as we plan to roll out new functionality and documentation for Mutation Calling workflows very soon. The new functionality will include Nozzle reports that will allow you to view the outputs without downloading files.

Thanks,
Jason

• ChinaMember Posts: 13

Hi @jneff,

I‘m excited about the somatic mutation calling best practice. Here is my question: How could this pipeline apply to non-cancer somtic mutation analysis? Because my project are to compare the mutations of the same tissue between patients and healths, so I can't get the paired study and control sample from the same individual that like cancer analysis. Could you give some advice for me? Thanks!

edited September 2016

Hi Tsunami,

The Broad Mutation Calling Best Practice workflows (QC, MuTect, and QC) run on pairs of tumor and normal BAM files for the same participant. Making the workflows runnable for your project may require more effort than simply changing the inputs and method configurations. There would likely be significant changes needed to the WDL and/or code for the runnable tools.

In that case, I would recommend checking out PLINK and see what is available through GATK. If any tools become available on FireCloud that may suit your project, we can let you know.

-Jason

• Member Posts: 35

Hello,
I have a question concerning Mutect workflow performance.
1.5h per WES pair sounds great but how much resources it uses (memory requirements, parallelization)?
I don't have access to FireCloud yet so I am currently testing on our local cluster to see if it would be feasible for big datasets.
And with -Xmx16g Mutect calling takes at least 10h per pair.
I would greatly appreciate any comments on how to improve the performance.
Thank you!

Best,

Eugenie

Hi Eugenie,

Are you running MuTect locally using Cromwell? Running MuTect in FireCloud would allow you to set runtime parameters (Method Config inputs) that can improve performance. For example, you can use the nWay parameter to adjust parallelism.

Our suggestion would be to test out different runtime parameters on a single pair in FireCloud. It's free to register for FireCloud, though you would need access to a Billing Project to clone a workspace and run analyses.

Thanks,
Jason

• Member Posts: 35

@jneff

Hi Jason,