The current GATK version is 3.7-0
Examples: Monday, today, last week, Mar 26, 3/26/04

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

You can opt in to receive email notifications, for example when your questions get answered or when there are new announcements, by following the instructions given here.

#### ☞ Formatting tip!

Wrap blocks of code, error messages and BAM/VCF snippets--especially content with hashes (#)--with lines with three backticks (  ) each to make a code block as demonstrated here.

GATK 3.7 is here! Be sure to read the Version Highlights and optionally the full Release Notes.

# SuperArray

edited September 2012

### 1. Introduction

The SuperArray annotator is invoked through the
SVVariantAnnotator walker, which defines arguments
common to all annotators.

The SuperArray annotator uses array intensity data to do a form of ''in
silico'' validation of copy number variants.

This annotator requires that each variant indicate (through an INFO tag) the set of samples that are thought to carry the variant (either as homozygotes or heterozygotes). A Wilcoxon rank sum test is performed and a p-value calculated
as follows: For each array probe underlying the variant, each sample is
assigned an integral rank for that probe. Then the set of ranks (across all
probes) is combined and treated as a set of observations for the Wilcoxon rank
sum test. If there is more than one probe, there will certainly be ties (i.e.
some sample will be rank 1 with respect to each probe). Ties are broken
randomly to assign the final ranking.

A Wilcoxon rank sum test is then used to test whether the event-carrying
samples (as indicated by the INFO tag) are shifted with respect to the non-
event-carrying samples (for deletions, this is a one tailed test of a negative
shift).

### 2. Inputs / Arguments

• -arrayIntensityFile <data-file> : The path to an input file containing a
matrix of array intensity values. : The file must be tab delimited with a
header line. Each line of the file contains data for one probe. The first four
columns should be named ID, CHR, START and END. ID is an identifier for the
probe and the other three columns give the 1-relative coordinates of the probe
(START and END can be equal). Columns beyond the first four provide intensity
data for each sample. The header for each additional column should contain the
sample ID.

• -superArraySampleTag <tag-name> : This is the name of the INFO field that
contains the list of carrier samples, which must be comma-separated. The default tag name is _SASAMPLES_. For example, SASAMPLES=NA12878,NA12891,NA12892 would indicate that three samples carry the
variant.

• -sample <sample-or-sample-list> : A subset of samples on which to perform the test (or a .list file of sample identifiers). The default behavior is to use all samples in the array intensity file.

• -superArrayPermute <true/false> : If set to true, then the sample identities are permuted before performing the test to generate a null distribution.

### 3. Annotations

This annotator produces three INFO field annotations for each VCF record:

• SANSAMPLES : The number of carrier samples.

• SANPROBES : The number of probes underlying the event.

• SAPVALUE : The calcualted p-value.

The annotator can also generate a tab-delimited report file containing these
annotations.

### 4. Example

The SuperArray annotator requires Genome STRiP and R.

export SV_DIR=/path/to/SVToolkit/root/directory

java -Xmx4g -cp SVToolkit.jar:GenomeAnalysisTK.jar \
-T SVVariantAnnotator \
-A SuperArray \
-R /humgen/1kg/reference/human_g1k_v37.fasta \
-BTI variant \
-B:variant,VCF input.vcf \
-O output.vcf \
-arrayIntensityFile Omni25_superarray_intensity_matrix.dat \
-sample discovery_samples.list \
-superArraySampleTag SAMPLES \
-writeReport \
-reportFile superarray_output.dat
`

### 5. Performance

The SuperArray annotator uses R as well as java and can consume up to 10G of
memory.

The SuperArray annotator uses an exact test in many cases and this can be
expensive. If you want to test more than a few hundred variants, you should
consider splitting up the input VCF file and processing them in parallel. In
sample runs, testing 1000 variants in 1000 samples can take 2 to 3 hours.

Post edited by Geraldine_VdAuwera on
Tagged: