Holiday Notice:
The Frontline Support team will be offline December 17-18 due to an institute-wide retreat and offline December 22- January 1, while the institute is closed. Thank you for your patience during these next few weeks as we get to all of your questions. Happy Holidays!

Latest Release: 12/4/18
Release Notes can be found here.

Broad Mutation Calling Best Practice Workflows

jneffjneff BostonMember, Broadie, Moderator admin
edited January 19 in Cancer Genome Analysis

The Broad Mutation Calling Best Practice Workflow

Overview

The Broad Mutation Calling Best Practice Workflow is a somatic mutation and variant analysis pipeline. It is broken down into four component workflows in FireCloud, QC, Copy Number, MuTect and Filtering. Three of these will have both Open Access (OA) and Controlled Access (CA) “Best Practice” workspaces (Copy Number, MuTect and Filtering) because they each contain a “Panel of Normals” (PoN).

The Open Access workspaces of the Broad Mutation Calling Best Practice Workflow are available to all FireCloud users, and contain paired cell line BAMs. Controlled Access workspaces will contain PoNs with protected data that requires prior dbGaP authorization to access. Go here for more information about obtaining access to these workspaces.

Best Practice Workspaces: Open Access (OA)

  • Broad_MutationCalling_QC_Workflow_BestPractice

  • Broad_MutationCalling_MuTect_Workflow_BestPractice_OA

  • Broad_MutationCalling_CopyNumber_Workflow_BestPractice_OA

  • Broad_MutationCalling_Filtering_Workflow_BestPractice_OA (under construction)

Best Practice Workspaces: Controlled Access (CA)

  • Broad_MutationCalling_MuTect_Workflow_BestPractice_CA (under construction)

  • Broad_MutationCalling_CopyNumber_Workflow_BestPractice_CA (under construction)

  • Broad_MutationCalling_Filtering_Workflow_BestPractice_CA

If you have access to a FireCloud Billing Project, you can clone any of these workspaces and launch an analysis through the Method Configurations tab.

Note: all methods that contain ‘_BETA’ in the name, are still undergoing active development.

Please refer to the How to Run this Module in FireCloud sections below for more information.

Broad Mutation Calling QC Workflow

Runtime

The total expected runtime for this workflow depends on the size of the pair or pair set you select for analysis. Pair HCC1954_100_gene_pair runs on "tiny" 100-gene BAMs and the expected runtime is roughly 15 minutes. Pair HCC1143_WE_pair runs on Whole Exome BAMs and the expected runtime is roughly 2.5 hours.

QC Task

The QC task counts reads overlapping regions for tumor and normal BAM files. The task concludes with a report of the counts over the BAMs and lanes. Correlation values are included for comparison purposes.

ContEST Task

ContEst uses a Bayesian approach to calculate the posterior probability of the contamination level and determine the maximum a posteriori probability (MAP) estimate of the contamination level.

ContEst supports array-free mode, where we genotype on the fly from matched normals, and use that as our source of homozygous variant calls. It currently calls anything with > 80% of bases as the alternate with at least 50X coverage a homozygous alternate site.

Picard Metrics Tasks

Picard Metrics Tasks invoke multiple metrics reporting routines from the Picard toolkit. The following routines are invoked:

  • CollectAlignmentSummaryMetrics

  • CollectInsertSizeMetrics

  • QualityScoreDistribution

  • CollectInsertSizeMetrics

  • QualityScoreDistribution

  • MeanQualityByCycle

  • CollectBaseDistributionByCycle

  • CollectSequencingArtifactMetrics

  • CollectQualityYieldMetrics

  • CollectOxoGMetrics.

SAM/BAM validation routines (ValidateSamFile) in both summary and verbose mode are called. Other routines such as CollectHsMetrics, and MarkDuplicates are also called.

Nozzle Report

Analysis results from QC, ContEst, and Picard Metrics are populated to HTML reports that you can view through the workspace Data tab. This report structures the results into tables.

Inputs and Outputs

Below are the tool-specific inputs and outputs for this workflow. Note that these inputs and outputs may vary from the “standard” version of the tool.

Inputs

  • File picardHapMap

  • File DB_SNP_VCF

  • File HaplotypeDBForCrossCheck

  • File tumorBamIdx

  • String tumorBamGB

  • File picardTargetIntervals

  • String pairName

  • File normalBam

  • String pairBamGB

  • File refFastaDict

  • String normalBamGB

  • File captureNormalsDBRCLZip

  • File refFasta

  • File readGroupBlackList

  • File tumorBam

  • File SNP6Bed

  • String validationStringencyLevel

  • File exomeIntervals

  • File refFastaIdx

  • File normalBamIdx

  • File picardBaitIntervals

  • File regionFile

  • File DB_SNP_VCF_IDX

  • Integer preemptibleVal

  • File HapMapVCF

Outputs

  • String fraction_contamination

  • File crossCheckStats

  • File crossCheckReport

  • File metricsReportsZip

  • File tumor_validation

  • File hsMetrics

  • File metricsReportsZip

  • File normal_validation

  • File normalhsMetrics

  • File QC_nozzle_report

  • File copy_num_report_html

How to Run this Module in FireCloud

  • Clone the Broad_MutationCalling_QC_Workflow_BestPractice workspace to run this workflow.

  • In your cloned workspace, navigate to the Method Configurations tab and click on the method config, MutationCalling_QC_v1-1_BETA_cfg.

  • Click Launch Analysis.

  • In the Launch Analysis window, toggle to pair and select a pair on which to run this workflow, e.g., HCC1143_WE_pair. You can also run this workflow on a pair set. To do so, toggle to pair_set, select HCC_pairs and enter this.pairs in the Define Expression field.

  • Finally, click the Launch button. Check back on the Monitor tab later to view results from your workflow analysis.

  • You can view results and outputs by clicking on the most recent analysis run, e.g., HCC1143_WE_pair.

  • Click on Outputs: Show, then select output files to view the results of this analysis.

Broad Mutation Calling MuTect Workflow

The Broad Mutation Calling MuTect workflow runs on a tumor and normal pair or pair set, and consists of three main steps: 1) preparation, 2) somatic mutation analysis, and 3) somatic mutation annotation. As a part of this workflow, MuTect is also run in force-call mode to search for somatic variants at a set of specified loci for clinical relevance (for example, TP53 gene on chromosome 17).

The three main steps are described here:
1. Preparation: The first step prepares input parameters, including the Intervals file, for analysis by MuTect1 and MuTect2. This step may be known as “scatter preparation.”

2. Somatic Mutation Analysis: 2 somatic variant analyses are carried out in parallel: MuTect1 and MuTect2. MuTect1 is used for its somatic single-nucleotide variant detection capabilities. MuTect2 is used for its indel detection capabilities. MuTect1 and Mutect2 are run in parallel over the Intervals file. This step may be known as “scatter”.

3. Somatic Mutation Gather: This final step serves to annotate variants found in the previous step. First the results are gathered and merged. Then, from the MuTect2 output, the indel-variants are separated and merged with the single-nucleotide variants from MuTect1. Next, only variants tagged with PASS are fed to Oncotator for annotation with Oncotator. Finally, all variants are passed to the VEP (Variant Effect Predictor) program for further annotation.

Runtime

The total expected runtime for the Broad Mutation Calling MuTect workflow depends on the size of the pair you select for analysis. Pair HCC1954_100_gene_pair runs on "tiny" 100 gene BAMs and the expected runtime is roughly 45 minutes. Pair HCC1143_WE_pair runs on Whole Exome BAMs and the expected runtime is roughly 1.5 hours.

MuTect1

MuTect1 is the original DREAM challenge-winning somatic point mutation caller (Cibulskis et al., 2013. It identifies somatic point mutations in next generation sequencing data of cancer genomes. It inputs sequencing data for matched normal and tumor tissue samples, and outputs mutation calls and optional coverage results.

In a nutshell, the analysis itself consists of three steps:

  • Pre-process the aligned reads in the tumor and normal sequencing data

  • Identify using statistical analysis and a Panel of Normals, sites that are likely to carry somatic mutations with high confidence. (See the Panel of Normals section for more information.)

  • Post-processing of candidate somatic mutations

For complete details, please see the 2013 publication in Nature Biotechnology.

MuTect2

MuTect2 is a somatic SNP and indel caller that combines the original MuTect (Cibulskis et al., 2013) with the assembly-based machinery of HaplotypeCaller.

The basic operation of MuTect2 proceeds similarly to that of the HaplotypeCaller.
While the HaplotypeCaller relies on a ploidy assumption (diploid by default) to inform its genotype likelihood and variant quality calculations, MuTect2 allows for a varying allelic fraction for each variant, as is often seen in tumors with purity less than 100%, multiple subclones, and/or copy number variation (either local or aneuploidy). MuTect2 also applies some hard filters to variants before producing output.

Oncotator

Oncotator is a tool for annotating information onto genomic point mutations (SNPs/SNVs) and indels. It is primarily used for human genome variant callsets. However, the tool can also be used to annotate any kind of information onto variant callsets from any organism.

By default, Oncotator uses a simple TSV file (e.g., MAFLITE) as an input and produces a TCGA-format MAF as an output. Oncotator also supports VCF files as an input and output format. By extension, Oncotator can be configured to annotate genomic point mutation data with HTML reports as it does in this workflow.

VEP (Variant Effect Predictor)

Ensembl’s VEP (Variant Effect Predictor) program processes variants for further annotation. This tool annotates variants, determines the effect on relevant transcripts and proteins, and predicts the functional consequences of variants.

VEP accepts as input coordinates any identified alleles or variant identifiers. A full list of input files is available here. If a variant that you enter as input causes a change in the protein sequence, the VEP will calculate the possible amino acids at that position and the variant would be given a consequence type of missense.

Nozzle Report

The final step of the workflow combines the analysis results from MuTect1, MuTect2, Oncotator, and VEP into a single HTML report. Our method configuration instructs FireCloud to write a link to the report back to the workspace data model, allowing you to view the report from the workspace Data tab. This report structures filtered results and annotations into tables.

Inputs and Outputs

Inputs

  • File tumorBam

  • File tumorBamIdx

  • File normalBam

  • File normalBamIdx

  • File refFastaIdx

  • File mutectIntervals

  • File mutectForcecallIntervals

  • File refFasta

  • File refFastaDict

  • String fracContam

  • File dbSNPVCF

  • File cosmicVCF

  • String downsampleToCoverage

  • File readGroupBlackList

  • File normalPanel

  • File oncoDBTarBall

  • File VEP_File

  • Int nWay

  • String mutectGBDiskVal

  • String oncoVEPDiskVal

  • String pairName

Outputs

  • File Task.mutectfc_cs

  • File mutectfc_pw

  • File mutectfc_cw

  • File preannotated_vcf

  • File oncotator_log

  • File mutect1Merged

  • File mutect2Merged

  • File VEP_VCF

  • File VEP_Output

  • File VEP_Report

  • File nozzleReport

  • File nozzleRData

How to Run this Module in FireCloud

  • Clone the Broad_MutationCalling_MuTect_Workflow_BestPractice workspace to run this workflow.

  • In your cloned workspace, navigate to the Method Configurations tab and click on the method config, MutationCalling_Mutect_v1-2_BETA_cfg.

  • Click Launch Analysis.

  • In the Launch Analysis window, toggle to pair and select a pair on which to run this workflow, e.g., HCC1143_WE_pair. You can also run this workflow on a pair set. To do so, toggle to pair_set, select HCC_pairs and enter this.pairs in the Define Expression field.

  • Finally, click the Launch button. Check back on the Monitor tab later to view results from your workflow analysis.

  • When the status displays Done, select the most recent analysis run, e.g., HCC1143_WE_pair.

  • Click on Outputs: Show, then select output files to view the results of this analysis.

Broad Mutation Calling Copy Number Workflow

The Copy Number workflow uses two types of information from sequencing data to detect copy number variations (CNVs). First, targets (usually exons, but any genomic locus) with abnormally high or low coverage suggest amplifications or deletions, respectively. Second, sites that are heterozygous in a normal sample and have allele ratios significantly different from 1:1 in the matched tumor sample imply a CNV event involving one or both alleles.

The workflow is correspondingly split into two major portions:

1. GATK CNV: Using coverage data that has been normalized against a Panel of Normals (PoN) to remove sequencing noise, targets are partitioned into segments that represent the same copy-number event. (See the Panel of Normals section for more information.) In GATK CVN, segmentation is then performed by a circular-binary-segmentation (CBS) algorithm, developed to segment noisy array copy-number data. Amplifications, deletions, and copy-neutral regions are then called from the segmentation.

2. GATK ACNV: Heterozygous sites are identified in the normal case sample and segmented, again using CBS, according to their ref:alt allele ratios in the tumor sample. These allele-fraction segments are combined with the copy-ratio segments found by GATK CNV to form a common set of segments. Modeling of both the copy ratio and minor allele fraction of each segment is alternated with the merging adjacent segments that are sufficiently similar according to this model, until convergence.

Steps

1. Coverage Collection

2. Process Panel of Normals

3. Segmentation by Tangent-normalized Coverage

4. Calling of Events from Coverage Segments

5. Collection of Allele Counts at HET Sites

6. Segmentation by Minor Allele Fraction

7. Union of Copy-ratio and Minor-allele-fraction Segments

8. Small-segment Merging

9. Model Fitting

10. Similar-segment Merging

Runtime

The total expected runtime for the Broad Mutation Calling Copy Number workflow depends on the size of the pair you select for analysis. Pair HCC1954_100_gene_pair runs on "tiny" 100 gene BAMs and the expected runtime is roughly 45 minutes. Pair HCC1143_WE_pair runs on Whole Exome BAMs and the expected runtime is roughly 1.5 hours.

Inputs and Outputs

Inputs

  • File recapseg_bed

  • File ref_fasta

  • File ref_fasta_dict

  • File ref_fasta_fai

  • File common_snp_list

  • File tumor_bam

  • File tumor_bam_idx

  • File normal_bam

  • File normal_bam_idx

  • File PoN

  • String Disk_GB

  • String entity_id

  • Boolean padTargets

  • Int PreemValue

Outputs

  • File gatk_cnv_coverage_file

  • File gatk_cnv_seg_file

  • File gatk_cnv_tn_coverage

  • File gatk_cnv_pre_tn_coverage

  • File gatk_het_ad_normal

  • File gatk_het_ad_tumor

  • File gatk_acnv_seg_file

How to Run this Module in FireCloud

  • Clone the Broad_MutationCalling_CN_Workflow_BestPractice workspace to run this workflow.

  • In your cloned workspace, navigate to the Method Configurations tab and click on the method
    config, MutationCalling_CN_v1-0_BETA_cfg.

  • Click Launch Analysis.

  • In the Launch Analysis window, toggle to pair and select a pair on which to run this workflow, e.g., HCC1143_WE_pair. You can also run this workflow on a pair set. To do so, toggle to pair_set, select HCC_pairs and enter this.pairs in the Define Expression field.

  • Finally, click the Launch button. Check back on the Monitor tab later to view results from your workflow analysis.

  • You can view results and outputs by clicking on the most recent analysis run, e.g., HCC1143_WE_pair.

  • Click on Outputs: Show, then select output files to view the results of this analysis.

Broad Mutation Calling Filtering Workflow

The Broad Mutation Calling Filtering workflow filters out somatic mutations that are more consistent with artifacts or germline mutations. This Filtering workflow takes as an input a VCF file that is generated by the Broad Mutation Calling MuTect workflow. Note that we provide this file as a workspace attribute onco_no_ann_out for the pre-loaded pair HCC1143_WE_pair.

Before filtering tasks are run, the VCF file is converted to a MAF file with Oncotator and initial PreAdapterDetailMetrics metrics are collected from the BAM file using a customized Picard command. The VCF-to-MAF conversion is done with some minimal annotation to add genomic context information to the generated VCF. Then, the MAF PoN Filter task builds a likelihood model based on sequencing conditions in a Panel of Normals (PoN) to filter out somatic mutations. The FFPE Filter and OxoG Filter tasks screen out Formalin-Fixed, Paraffin-Embedded (FFPE) and oxoguanine (OxoG) artifacts respectively. Finally, a MAF Merge task combines all filtered results into a single VCF file and the Oncotation task runs Oncotator to annotate the VCF file.

VCF to MAF Converter

The MAF PoN Filter task takes as a primary input a MAF file. In order to accommodate the VCF outputs from the MuTect workflow, this task converts a VCF file to MAF format, using the Oncotator tool.

Generate Metrics

The Generate Metrics task collects statistics on the BAM files that are needed for the subsequent filtering tasks.

MAF PoN Filter

The MAF PoN Filter uses a likelihood model to compare somatic mutations against a Panel of Normals (PoN) in order to screen out somatic mutations. The PoN represents sequencing conditions in the case sample, including germline variants and technical artifacts. Refer to the Panel of Normals section for more information.

FFPE Filter

FFPE introduces multiple types of DNA damage including deamination, which converts cytosine to uracil and leads to downstream mispairing in PCR: C>T / G>A. Because deamination occurs prior to ligation of palindromic Illumina adapters, likely deamination artifacts will have a read orientation bias. The FFPE Filter Task uses this read orientation to identify artifacts and calculate a Phred scaled Q-score for FFPE artifacts.

OxoG Filter

Oxidation of guanine to 8-oxoguanine is one of the most common pre-adapter artifacts associated with genomic library preparation, arising from a combination of heat, shearing, and metal contaminates in a sample). The 8-oxoguanine base can pair with either cytosine or adenine, ultimately leading to G→T transversion mutations during PCR amplification. The OxoG Filter Task detects and screens out these artifacts.

MAF Merge

After the filtering steps are completed, the MAF Merge task takes the filtered variants from the three output MAF files (OxoG Filter, FFPE Filter and MAF PoN Filter), and adds them to the original VCF file.

Oncotation

The Oncotation task adds annotations to the VCF File using the Oncotator tool.

Runtime

The total expected runtime for the Broad Mutation Calling Filtering workflow depends on the size of the pair you select for analysis. Pair HCC1143_WE_pair runs on Whole Exome BAMs and the expected runtime is roughly 2.5 hours.

Inputs and Outputs

Inputs

  • File DBSNP

  • String oxoGOBF ALTALLELE

  • Integer CODING ONLY

  • File oncotation TarBall

  • String ffpeOBF ALTALLELE

  • String PairID

  • File Generate Metrics

  • File PoNFile

  • Float WCUT

  • Float thresh

  • String oxoGOBF Context

  • File DBSNPIDX

  • File tumor BAM

  • String NMIN

  • File cytoBandFile

  • String ffpeOBF Context

  • String TOTNStr

  • File REFERENCE

  • File oncoDBTarBall

  • String MIN_ALT_COUNT

  • String ffpe OBF Stub

  • File inVCF

  • String TUM ID

  • File tumor BAM IDX

  • String oxoGOBF STUB

Outputs

  • File Generate Metrics

  • File MAF PoN Filter Zip

  • File oxoG MatLab And Filter Data Zip

  • File ffpe MatLab And Filter Data Zip

  • File Merged Filtered VCF

  • File Annotated Filtered VCF

How to Run this Module in FireCloud

  • Clone the Broad_MutationCalling_Filter_Workflow_BestPractice_CA workspace to run this workflow. Note that this is a Controlled Access workspace because the PoN contains identifiable data.
  • In your cloned workspace, navigate to the Method Configurations tab and click on the method, MutationCalling_Filter_v1-1_BETA.
  • Click Launch Analysis.
  • In the Launch Analysis window, toggle to pair and select a pair on which to run this workflow. The preloaded pair is HCC1143_WE_pair.
  • Finally, click the Launch button. Check back on the Monitor tab later to view results from your workflow analysis.
  • You can view results and outputs by clicking on the most recent analysis run, e.g., HCC1143_WE_pair.
  • Click on Outputs: Show, then select output files to view the results of this analysis.

Panel of Normals

The Panel of Normals (PoN) plays two important roles in the somatic mutation and copy number variant analysis and filtering workflows: (1) it excludes germline mutation or variant sites that are found in the normals to avoid calling them as potential mutations or variants in the tumor; and (2) it excludes technical artifacts that arise from particular techniques (e.g., sample preservation) and technologies (e.g., library capture, sequencing chemistry). An effective PoN consists of normals that are technically similar to and representative of the tumor samples, as well as sequenced on the same platform. Three of the four Broad Mutation Calling modules (Mutect, CN and Filtering) require use of PoNs.

As the TCGA PoNs contain identifiable data and are therefore considered Controlled Access, we will provide both Open Access and Controlled Access Best Practice Workspaces containing, respectively, Open and Controlled Access PoNs. PoNs are referenced in the workspace as Workspace Attributes.

Please note that the Open Access PoNs were created from a limited number of low-coverage 1000 Genomes samples. As such, these Open Access PoNs will not deliver adequate filtering results and will have a higher rate of false positives than well matched PoNs. The TCGA Controlled Access PoNs are a well-curated set of approximately 1000 TCGA normals.

We recommend, when importing your own data to FireCloud, that you build your own PoN using instructions provided within the GATK forum (note that in the near future this will switch over to the GATK Best Practices somatic SNVs and indels workflow and somatic CNVs workflow pages). When constructing a PoN, it is very important to use normals that are technically similar to and representative of your tumor samples, as well as sequenced on the same platform.

Open Access PoN

Open Access PoNs are publically available for MuTect and Copy Number.

  • MuTect: refseq_exome_10bp_hg19_300_1kg_normal_panel.vcf

  • Copy Number: open_access_pon_from_1000g.pon

These Open Access PoNs comprise a limited number of consented whole exome samples from the 1000 Genomes project: 300 samples for the MuTect PoN and 20 samples for the Copy Number PoN.

Controlled Access PoN

Users with dbGaP approval for controlled access TCGA data may access and use the Controlled Access PoNs, e.g., Filtering: final_summed_tokens.hist.bin for the Filter workflow. The Controlled Access PoNs comprise approximately 1000 carefully curated TCGA normals sequenced with Agilent, and span across multiple TCGA cohorts (excluding blood cancer). This will result in fewer false positives than the Open Access PoNs.
For more information about how to gain dbGaP approval for controlled access data, go here.

References

Post edited by Geraldine_VdAuwera on

Comments

  • bruce01bruce01 Member

    Hi, a question on the MuTect2 aspect of the workflow, you have "fracContam" as input, where are you making this (ContEST?) Can you give a little information on any filtering etc of output to make this file?

    Thanks, Bruce.

  • jneffjneff BostonMember, Broadie, Moderator admin
    edited June 2016

    Hi Bruce,

    If you were to clone the Broad_MutationCalling_QC_Workflow_V1_BestPractice workspace and run an analysis, the primary ContEST output would be the fracContam estimate, which I believe is 0.02 or 2% for the pre-loaded data within that workspace.

    Within the separate Broad_MutationCalling_MuTect_Workflow_V1_BestPractice workspace, you'll find fracContam as an input string (number) that you can edit at runtime within the Method Configuration. The default input is 0.02, I believe because the two workspaces are referring to the same pre-loaded data.

    Thanks,
    Jason

  • malavehmalaveh UCSFMember

    Would you be able to give me a quick summary of how to interpret the outputs for this tutorial, as well as which output files are most relevant/informative when determining mutations? I am especially interested in understanding the MuTect1 and MuTect2 Outputs, Oncotator and VEP Outputs, and QC Outputs.

    Thanks,
    Heydi

  • jneffjneff BostonMember, Broadie, Moderator admin

    Hi Heydi,

    Please stay tuned as we plan to roll out new functionality and documentation for Mutation Calling workflows very soon. The new functionality will include Nozzle reports that will allow you to view the outputs without downloading files.

    Thanks,
    Jason

  • TsunamiTsunami ChinaMember

    Hi @jneff,

    I‘m excited about the somatic mutation calling best practice. Here is my question: How could this pipeline apply to non-cancer somtic mutation analysis? Because my project are to compare the mutations of the same tissue between patients and healths, so I can't get the paired study and control sample from the same individual that like cancer analysis. Could you give some advice for me? Thanks!

  • jneffjneff BostonMember, Broadie, Moderator admin
    edited September 2016

    Hi Tsunami,

    The Broad Mutation Calling Best Practice workflows (QC, MuTect, and QC) run on pairs of tumor and normal BAM files for the same participant. Making the workflows runnable for your project may require more effort than simply changing the inputs and method configurations. There would likely be significant changes needed to the WDL and/or code for the runnable tools.

    In that case, I would recommend checking out PLINK and see what is available through GATK. If any tools become available on FireCloud that may suit your project, we can let you know.

    -Jason

  • EugenieEugenie Member

    Hello,
    I have a question concerning Mutect workflow performance.
    1.5h per WES pair sounds great but how much resources it uses (memory requirements, parallelization)?
    I don't have access to FireCloud yet so I am currently testing on our local cluster to see if it would be feasible for big datasets.
    And with -Xmx16g Mutect calling takes at least 10h per pair.
    I would greatly appreciate any comments on how to improve the performance.
    Thank you!

    Best,

    Eugenie

  • jneffjneff BostonMember, Broadie, Moderator admin

    Hi Eugenie,

    Are you running MuTect locally using Cromwell? Running MuTect in FireCloud would allow you to set runtime parameters (Method Config inputs) that can improve performance. For example, you can use the nWay parameter to adjust parallelism.

    Our suggestion would be to test out different runtime parameters on a single pair in FireCloud. It's free to register for FireCloud, though you would need access to a Billing Project to clone a workspace and run analyses.

    Thanks,
    Jason

  • EugenieEugenie Member

    @jneff

    Hi Jason,
    Thank you for the reply.
    We will probably get access to FireCloud at some point but I don't think it will help with this particular dataset as the patient data has to be analysed locally.
    Looking forward to test FireCloud though.

    Best,

    Eugenie

  • YingLiuYingLiu ChinaMember

    @jneff
    how to download these four workflow ?? , I wanna run them at my local machine .
    so I wanna to download these workflow local , and I can research these source code ,then run this pipeline at my local machine .

  • how to download these four workflow (Broad Mutation Calling ). I would run this pipeline at my local machine. my email:[email protected]

  • H, Can I to run my sample (Tumor-paired) on firecloud workflows?
    Thanks,

  • birgerbirger Member, Broadie, CGA-mod ✭✭✭

    @mattoslmp
    When you say you want to run the workflow on a local machine, do you mean you want to run it using a local version of cromwell? You can obtain the WDL directly from FireCloud. When viewing the workflows in the Method Repository, the WDL tab provides a display of the workflow's WDL...you can copy and paste that WDL into a local file and then use it as input to your cromwell command line.

  • MattBMattB NewcastleMember

    The Controlled Access workspaces PoN link above http://gatkforums.broadinstitute.org/firecloud/discussion/7521/controlled-access-data-in-firecloud#latest is not found, is this data still available via dbGaP? Could you please provide the details?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @MattB, it looks like that doc got deleted when we recently reorganized some content. Let me find out what replaced it. I believe the data is indeed still available, perhaps @birger can provide some information while I hunt down the link to update this doc.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
  • MattBMattB NewcastleMember

    Thanks for the link will follow those guides. Although not sure how the shutdown will affect the eRA Commons / NIH staff - I note the disclaimer on the NCBI home page

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Ah yes since that’s not an automated process afaik I don’t think anything will happen on that front until the shutdown is over. Somewhere in space a Vulcan ethnographer is thinking wow, this species really raises the bar on political dysfunction.

  • hi @birger ,I tried to download (sudo docker pull broadinstitute/cromwell) the image (from registriy: hub.docker.com/r/broadinstitute/cromwell), but the following menssage is showing: Error response from daemon: manifest for broadinstitute/cromwell:latest not found. Is therey other place to this image?

  • dheimandheiman Member, Broadie ✭✭

    @mattoslmp cromwell can be found on Github: https://github.com/broadinstitute/cromwell. The command line tool can be found on the release section of that page.

  • MattBMattB NewcastleMember

    I'm using firecloud but I can't now find any reference to the Broad_MutationCalling_MuTect_Workflow_BestPractice_CA workflow described here, I do however have a clone of the mutect2-gatk4 method (has this replaced it?) how do I go about using the controlled access TCGA PON (I've now been granted access to this). I can see a pon and pon_index settings in the method configurator and my workspace has TCGA authorisation.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Yes the GATK4 Mutect2 workflow replaces the previous workflow. You should be able to enter the gs:// path to the PON vcf file you were given access to in the pon field of the method config, and the same for its index.

    When you say you have a clone of the method, do you mean the method itself or a Workspace with the method config already set up? The latter should be available in the Featured Workspaces. It makes it a lot easier to get started. Let us know if you need more guidance.

  • MattBMattB NewcastleMember

    Ah ok, I'm attempting to run on my own data, so not 100% sure if cloning the existing workspace works best? Can I just upload my own data into the cloned one, amending the various metadata files and deleting the existing data? Or can I clone a method along with it's existing configuration into a new workspace - which is what I was trying to do, not sure which works the best?

    So I've now cloned the Mutect2 (must try not to capitalise the T.. :smile: ) workspace, however one problem I have is that a name of said PoN (as a file) was perviously stated somewhere in the documentation (possibly the WDL comments from the older WDL referenced above), and I've now no idea how to find it or what bucket it now resides in! Basically I read this document a few weeks back applied for TCGA access and have now come back but can't quite now connect what I've read here about controlled access workspaces and PoNs to the new Somatic-SNVs-Indels-GATK4 workspace.

    Also if I have a set of my own normal genomes is there a WDL method for creating a PoN of my own? A quick search in FireCloud has not revealed one. I'm assuming the TCGA PoN is less optimal for my usage case than my own but this might be more optimal than no PoN at all?

    Also from looking at the Workspace Attributes baked into my cloned workspace I assume this is all set up for b37 so I'll need b37 aligned analysis read BAMs of my own to use this workspace, unless I swap them out for these ones here: https://console.cloud.google.com/storage/browser/gatk-best-practices/somatic-hg38/?project=broad-dsde-outreach&pli=1 However my recollection was that the TCGA PoN was only b37...

    UPDATE: have just located this bunch of files https://gatkforums.broadinstitute.org/firecloud/discussion/8571/tcga-reference-files#latest, I assume the controlled access version of the somatic calling stated here previously used: final_summed_tokens.hist.bin only because the rest of the PoN like things are open access or am I conflating the PoN for calling with the data for FilterMutectCalls as I note there is also a file: 10bp_hg19_300_1kg_normal_panel.vcf but again I'm unsure as this is not restricted as this document makes out the PoN should be.

    Apologies, I realise this post now crosses several different categories of troubleshooting all at once! ;)

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @MattB,

    Cloning the existing workspace is the most convenient way to try out the workflow on an example that we know works and has a fully functional configuration all set up. Then you can add a few samples from your own data to run them through and satisfy yourself that everything works as expected. From there, you can choose whether to continue in that workspace (potentially after deleting the example data) or creating a new workspace and importing the config to there. The former is possible (but you have to use the API to delete data entities) but the latter is cleaner and is what I would do.

    Regarding what is the right file to use as a PON: in general you're looking for a VCF; and if you're working with TCGA data there may be an appropriate PON available to you but, that is outside of my scope -- @birger or @eddieasalinas might be able to point you to the relevant resource files.

    To your point about making your own PON, yes that is generally recommended. We haven't set up the M2 PON creation WDL in FC yet due to bandwidth limitations, but the WDL is available in the GATK4 github repo -- let me know if you need a pointer to the specific file. We plan to make it easier to access once we've addressed the most pressing needs.

    Genome reference builds are the bane of my existence. Traditionally at Broad everything has been done in b37 (often under the name hg19 but not really, or with some hackiness in a hybrid fashion); but we are moving everything to hg38, slowly but surely. The bucket you found is a preliminary pass at making hg38 somatic resources though. There might be some caveats...

  • MattBMattB NewcastleMember
    edited February 22

    Hi @Geraldine_VdAuwera,

    Thanks for the tip re ways of cloning workspaces / methods - I was unsure as to what was the best way to go. I'll give the existing one a test with a pair of my own added to it and then attempt to clone the methods and configs in to a new workspace once I know that works.

    From what I understand whilst I can export a method and its config to a new workspace using the GUI things like workspace.gnomad won't be populated in a new (uncloned) workspace but I can export these attributes via a .tsv (from a working setup) and import them to my new workspace? - It might be good to have an import from existing workspace option without the file download/upload round trip.

    As regards the pre-processing of pairs looking over the workspace:help-gatk/Pre-processing_b37_v3 there are many attributes here which are not going to be present and/or are different (such as the docker image) from that in the help-gatk/Somatic-SNVs-Indels-GATK4 workspace. Consequently for complete processing from unaligned.BAM of my own I'm assuming I can clone the methods and import the attributes from the featured workspaces, but pre-processing and the pair somatic calling should be ideally be done in separate workspaces? Can I then point the second M2 workspace at the output of the first pre-processing one via referencing the GS bucket address of the output somewhere in the pair info metadata? I'm just trying to get my head around how I can perform analysis on my own data using FC in the most pain free way!

    Re calling on TCGA data I'm still not 100% sure where the controlled access PoN now resides GS wise to use as a file location for my M2 config, so if anyone knows please let me know?! Yes totally understand it's only really going to be optimal for TCGA produced data.

    With regard to M2 PON creation in WDL by "bandwidth" I assume you mean human bandwidth here or is there some issue with regard to runtime/IO of doing that in FC?

    Wholeheartedly agree s/chr// related issues have been the bane of the community for as long as I can remember! Yes had assumed the liftOver procedure was far from 100% perfect in the first instance.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hey @MattB , sorry for the long lag. Presumably you've found out everything you needed by now but let me answer your questions for the record.

    1. Workspace attributes: it's admittedly weird and dumb that they're so separate from the rest of the data/metadata. There are plans to move them to the Data section and make it possible to treat them like other pieces of data, including importing from another workspace. In the meantime, you can dump to TSV and re-upload, yes.

    2. The separation of the pre-processing and somatic workspaces is largely an artifact of how we make the methods available plus some history of the somatic methods development. In principle you could consolidate everything into a single workspace and run everything there; in fact that's probably what I would recommend doing, to keep all the work on a project in the same place. On the germline side we have the preprocessing pipeline set up alongside the variant calling already. But that being said if you choose to keep them separate you can absolutely reference the data from the other workspace.

    3. I'm sorry, I have no idea where the TCGA controlled access resources live (my access credentials have expired and I keep forgetting to reapply). I recommend asking that sort of question separately rather than as part of a big discussion so that our frontline support team members can recognize it needs to be dispatched to someone else.

    4. Yes sorry, meant human bandwidth.

    Aside from this I hope FireCloud is working out for you; it's very much still a work in progress, so let me know if you have any feedback on what improvements would have the most potential impact on usability and efficiency.

  • MattBMattB NewcastleMember

    Hi @Geraldine_VdAuwera, thanks for your suggestions re consolidating workspaces, and yes will repost TCGA panel question in ask the team, will definitely be in contact re feedback, so far just been testing a few cases getting a feel for the system and its conventions.

  • JohnyWalterJohnyWalter Member
    edited July 16

    How much useful information, thanks a lot.
    This is really a good practice for Workflows.Any new git-hub update?Special workshop
    Thanks

    Post edited by JohnyWalter on
Sign In or Register to comment.