GISTIC2.0

jneffjneff BostonMember, Broadie, Moderator admin
edited September 2016 in Cancer Genome Analysis

Overview

The GISTIC2.0 module identifies regions of the genome that are significantly amplified or deleted across a set of samples. Each aberration is assigned a G-score that considers the amplitude of the aberration as well as the frequency of its occurrence across samples. False Discovery Rate q-values are then calculated for the aberrant regions, and regions with q-values below a user-defined threshold are considered significant.

For each significant region, a “peak region” is identified, which is the part of the aberrant region with greatest amplitude and frequency of alteration. In addition, a “wide peak” is determined using a leave-one-out algorithm to allow for errors in the boundaries in a single sample. The “wide peak” boundaries are more robust for identifying the most likely gene targets in the region.

Each significantly aberrant region is also tested to determine whether it results primarily from broad events (longer than half a chromosome arm), focal events, or significant levels of both. The GISTIC module reports the genomic locations and calculated q-values for the aberrant regions. It identifies the samples that exhibit each significant amplification or deletion, and it lists genes found in each “wide peak” region.

Runtime

The total expected runtime for GISTIC2.0 depends on the size of the pair set you select for analysis. The TCGA BRCA pair set runs on 2263 tumor/normal samples and the expected runtime is roughly 2 hours. The TCGA ACC pair set runs on 184 tumor/normal samples and the expected runtime is roughly 30 minutes.

Inputs Parameters

This section specifies Inputs parameters that appear as Method Config Inputs before you launch an analysis. For more information on the file specifications, including column requirements, see the Inputs section below.

seg_file: A six-column tab-delimited file containing the segmented data for all tumor/normal pairs in the pair set. The seg_files for this workspace reside in a Google bucket. There are separate seg_files for ACC and BRCA. These links also appear as attributes for the ACC and BRCA pair sets in the Workspace Data tab.

markers_file: A three-column tab-delimited file identifying the names and positions all markers. The markers_file for this workspace resides in a Google bucket. You can also find this link in Workspace Attributes, viewable through the Workspace Summary tab.

refgene_file: Contains information about the location of genes and cytobands on a given build of the genome. These files are created in MATLAB. The refgene_file for this workspace resides in a Google bucket. You can also find this link in Workspace Attributes, viewable through the Workspace Summary tab.

cnv_files: There are two options for the file specifying germline CNVs to be excluded from the analysis. The first option allows CNVs to be identified by marker name and is platform-specific. The second option allows the CNVs to be identified by genomic location, which is platform independent but genome-build dependent.

The cnv_file for this workspace resides in a Google bucket. You can also find this link in Workspace Attributes, viewable through the Workspace Summary tab.

Option #1: A two column, tab-delimited file with an optional header row. The marker names given in this file must match the marker names given in the markers file. The CNV identifiers are for user use and can be arbitrary. The column headers are: (1) Marker Name and (2) CNV Identifier

Option #2: A 6 column, tab-delimited file with an optional header row. The ‘CNV Identifier’ is for user use and can be arbitrary. ‘Narrow Region Start’ and ‘Narrow Region End’ are also not used. The column headers are:

  • (1) CNV Identifier

  • (2) Chromosome

  • (3) Narrow Region Start

  • (4) Narrow Region End

  • (5) Wide Region Start

  • (6) Wide Region End

amp_thresh: Threshold for copy number amplifications. Regions with a log2 ratio above this value are considered amplified. (Recommended: 0.1)

del_thresh: Threshold for copy number deletions. Regions with a log2 ratio below the negative of this value are considered deletions. (Recommended: 0.1)

qv_thresh: Significance threshold for Q-values. Regions with Q-values below this number are considered significant. (Recommended: 0.1)

cap: Minimum and maximum cap values on analyzed data. Regions with a log2 ratio greater than the cap are set to the cap value; regions with a log2 ratio less than -cap value are set to -cap. (DEFAULT=1.5)

broad_length_cutoff: Threshold used to distinguish broad from focal events, given in units of fraction of chromosome arm. (Recommended: 0.7)

remove_X: 0/1 flag indicating whether to remove data from the X chromosome before analysis. (Recommended: 0)

conf_level: Confidence level used to calculate region containing the driver. (Recommended: 0.99)

join_segment_size: Smallest number of markers to allow in segments from the segmented data. Segments that contain a number of markers less than or equal to this number are joined to the adjacent segment, closest in copy number. (Recommended: 4)

arm_peel: 0/1 flag indicating whether to perform arm-level peel off, which helps separate peaks and clean up noise. (Recommended: 1)

max_sample_segs: Maximum number of segments allowed for a sample in the input data. Samples with more segments than this are excluded from the analysis. (Recommended: 2000)

do_gene_gistic: 0/1 flag indicating that the gene GISTIC algorithm should be used to calculate significance of deletions at the gene level instead of a marker level. (Recommended: 1)

gene_collapse_method: Method for reducing marker-level copy number data to the gene-level copy number data in the gene tables. Markers contained in the gene are used when available, otherwise the flanking marker or markers are used. Allowed values are mean, median, min, max or extreme. The extreme method chooses whichever of min or max is furthest from diploid. (Recommended: extreme)

memoryGB: Integer value specifying the minimum memory requirements (in GB) for the virtual machine running the GISTIC2 task.

preemptible: Integer value specifying the maximum number of times Cromwell should request a preemptible machine for this task before defaulting back to a non-preemptible one.

Input Files

Segmentation File

The segmentation file contains the segmented data for all the samples identified by GLAD, CBS, or some other segmentation algorithm. (See GLAD file format in the Genepattern file formats documentation.) It is a six column, tab-delimited file with an optional first line identifying the columns. Positions are in base pair units.

The column headers are:

  • (1) Sample (sample name)

  • (2) Chromosome (chromosome number)

  • (3) Start Position (segment start position, in bases)

  • (4) End Position (segment end position, in bases)

  • (5) Num markers (number of markers in segment)

  • (6) Seg.CN (log2() -1 of copy number)

Example Segmentation File

Markers File

The markers file identifies the marker names and positions of the markers in the original dataset (before segmentation). It is a three column, tab-delimited file with an optional header. The column headers are:

  • (1) Marker Name

  • (2) Chromosome

  • (3) Marker Position (in bases)

Example Markers File

Reference Genome File (-refgene) REQUIRED

The reference genome file contains information about the location of genes and cytobands on a given build of the genome. Reference genome files are created in Matlab and are not viewable with a text editor. The GISTIC 2.0 release has four reference genomes located in the refgenefiles directory: hg16.mat, hg17.mat, hg18.mat, and hg19.mat.

CNV File

There are two options for the file specifying germline CNVs to be excluded from the analysis. The first option allows CNVs to be identified by marker name and is platform-specific. The second option allows the CNVs to be identified by genomic location, which is platform independent but genome-build dependent.

Option #1: A two column, tab-delimited file with an optional header row. The marker names given in this file must match the marker names given in the markers file. The CNV identifiers are for user use and can be arbitrary. The column headers are: (1) Marker Name and (2) CNV Identifier

Option #2: A 6 column, tab-delimited file with an optional header row. The ‘CNV Identifier’ is for user use and can be arbitrary. ‘Narrow Region Start’ and ‘Narrow Region End’ are also not used. The column headers are:

  • (1) CNV Identifier
  • (2) Chromosome
  • (3) Narrow Region Start
  • (4) Narrow Region End
  • (5) Wide Region Start
  • (6) Wide Region End

Example CNV File

Outputs

GISTIC Runtime Info

E.g., runtime_info.txt. Runtime stats for the recent GISTIC run, including number of processors and memory.

Amplification Genes File

E.g., amp_genes.txt. The amp genes file contains amplification peaks identified in the GISTIC analysis. The first four rows are:

(1) cytoband

(2) q-value

(3) residual q-value

(4) wide peak boundaries

These rows identify the lesion in the same way as the all lesions file. The remaining rows list the genes contained in each wide peak. For peaks that contain no genes, the nearest gene is listed in brackets.

Deletion Genes File

E.g., del_genes.txt. The del genes file contains one column for each deletion peak identified in the GISTIC analysis. The file format for the del genes file is identical to the format for the amp genes file.

Amplification Score GISTIC plot

E.g., amp_qplot.png. The amplification .png is a plot of the G-scores (top) and q-values (bottom) with respect to amplifications for all markers over the entire region analyzed.

Deletion Score/q-value GISTIC plot

E.g., del_qplot.png. The deletion .png is a plot of the G-scores (top) and q-values (bottom) with respect to deletions for all markers over the entire region analyzed.

Segmented Copy Number

E.g., raw_copy_number.png. A .png file containing a heatmap image of the genomic profiles of the segmented input copy number data. The genome is represented along the vertical axis and samples are arranged horizontally.

GISTIC Version

E.g., gistic_version.txt. The version of GISTIC used in the most recent run. (Current version is 2.0.22.)

GISTIC Array List File

E.g., arraylistfile.txt. The Array List File is a one column file that identifies all samples that were used in the analysis.

Broad Significance Results

E.g., broad_significance_results.txt. A summary of significant arm-level results, including Arm, # of genes, Amp frequency, Amp Z score, Amp Q value, Del frequency Del Z score, and Del Q value.

GISTIC Inputs File

E.g., gisticinputs.txt. A summary of the inputs and runtime parameters for the recent GISTIC run.

GISTIC Nozzle Report

E.g., call-report_gistic2/nozzle.html. An HTML summary of GISTIC results, including significant arm-level results, significant focal amplifications, and significant focal deletions.

How to run GISTIC2.0 in FireCloud

1. Navigate to the workspace broad-firecloud-tutorials/Broad_GISTIC2_Workflow_BestPractice.

2. Go to the Method Configurations tab and select Gistic2_v1-0_BETA_cfg.

3. Click Launch Analysis.

4. Sort to pair_set. Then select a pair set (e.g., ACC or BRCA) and click Launch.

5. In the Monitor tab, view the status of your analysis. Initially, the status displays Submitted.

6. When the status displays Done, click on the name of the pair set (e.g., ACC) to view results.

7. Click on Outputs: Show, then select output files to view the results of this analysis.

8. You can also view results, including Nozzle reports and graphical analysis, by viewing attributes in the Data tab.

References

  • Beroukhim R, Getz G, et al. (2007). "Assessing the significance of chromosomal aberrations in cancer: Methodology and application to glioma." Proc Natl Acad Sci, 104:20007-20012.

  • Beroukhim R, Mermel C, et al. (2010). "The landscape of somatic copy -number alteration across human cancers." Nature, 463:899-905.

  • GISTIC2 GenePattern Documentation. Note: you must register and log in to GenePattern to view.

  • McCarroll, S. A. et al., Integrated detection and population-genetic analysis of SNPs and copy number variation, Nat Genet* Vol. 40(10):1166-1174 (2008).

  • Mermel C, Schumacher S, et al. (2011). "GISTIC2.0 facilitates sensitive and confident localization of the targets of focal somatic copy-number alteration in human cancers." Genome Biology, 12:R41.

  • The Sanger Institute: Cancer Gene Census.

Post edited by jneff on

Comments

  • yaoqianlanyaoqianlan china shanghaiMember

    Hi

    When I use gistic(2.0.23) I met a strange thing. The input segment file include 55 sample, but the output file of gistic (such as "all_lesions.conf_90") only contain 18 sample ( see the attached files). And I wondered where the problem is ?

    Looking forward to your reply.
    
    Thank you.
    
Sign In or Register to comment.