SuperArray

Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,273Administrator, GATK Dev admin
edited September 2012 in GenomeSTRiP Documentation

1. Introduction

The SuperArray annotator is invoked through the
SVVariantAnnotator walker, which defines arguments
common to all annotators.

The SuperArray annotator uses array intensity data to do a form of ''in
silico'' validation of copy number variants.

This annotator requires that each variant indicate (through an INFO tag) the set of samples that are thought to carry the variant (either as homozygotes or heterozygotes). A Wilcoxon rank sum test is performed and a p-value calculated
as follows: For each array probe underlying the variant, each sample is
assigned an integral rank for that probe. Then the set of ranks (across all
probes) is combined and treated as a set of observations for the Wilcoxon rank
sum test. If there is more than one probe, there will certainly be ties (i.e.
some sample will be rank 1 with respect to each probe). Ties are broken
randomly to assign the final ranking.

A Wilcoxon rank sum test is then used to test whether the event-carrying
samples (as indicated by the INFO tag) are shifted with respect to the non-
event-carrying samples (for deletions, this is a one tailed test of a negative
shift).

2. Inputs / Arguments

  • -arrayIntensityFile <data-file> : The path to an input file containing a
    matrix of array intensity values. : The file must be tab delimited with a
    header line. Each line of the file contains data for one probe. The first four
    columns should be named ID, CHR, START and END. ID is an identifier for the
    probe and the other three columns give the 1-relative coordinates of the probe
    (START and END can be equal). Columns beyond the first four provide intensity
    data for each sample. The header for each additional column should contain the
    sample ID.

  • -superArraySampleTag <tag-name> : This is the name of the INFO field that
    contains the list of carrier samples, which must be comma-separated. The default tag name is _SASAMPLES_. For example, SASAMPLES=NA12878,NA12891,NA12892 would indicate that three samples carry the
    variant.

  • -sample <sample-or-sample-list> : A subset of samples on which to perform the test (or a .list file of sample identifiers). The default behavior is to use all samples in the array intensity file.

  • -superArrayPermute <true/false> : If set to true, then the sample identities are permuted before performing the test to generate a null distribution.

3. Annotations

This annotator produces three INFO field annotations for each VCF record:

  • SANSAMPLES : The number of carrier samples.

  • SANPROBES : The number of probes underlying the event.

  • SAPVALUE : The calcualted p-value.

The annotator can also generate a tab-delimited report file containing these
annotations.

4. Example

The SuperArray annotator requires Genome STRiP and R.

export SV_DIR=/path/to/SVToolkit/root/directory

java -Xmx4g -cp SVToolkit.jar:GenomeAnalysisTK.jar \
    org.broadinstitute.sting.gatk.CommandLineGATK \ 
    -T SVVariantAnnotator \ 
    -A SuperArray \ 
    -R /humgen/1kg/reference/human_g1k_v37.fasta \ 
    -BTI variant \
    -B:variant,VCF input.vcf \ 
    -O output.vcf \ 
    -arrayIntensityFile Omni25_superarray_intensity_matrix.dat \ 
    -sample discovery_samples.list \
    -superArraySampleTag SAMPLES \ 
    -writeReport \ 
    -reportFile superarray_output.dat

5. Performance

The SuperArray annotator uses R as well as java and can consume up to 10G of
memory.

The SuperArray annotator uses an exact test in many cases and this can be
expensive. If you want to test more than a few hundred variants, you should
consider splitting up the input VCF file and processing them in parallel. In
sample runs, testing 1000 variants in 1000 samples can take 2 to 3 hours.

Post edited by Geraldine_VdAuwera on

Geraldine Van der Auwera, PhD

Sign In or Register to comment.