Bug Bulletin: The GenomeLocPArser error in SplitNCigarReads has been fixed; if you encounter it, use the latest nightly build.

SuperArray

Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,396Administrator, GATK Developer admin
edited September 2012 in GenomeSTRiP Documentation

1. Introduction

The SuperArray annotator is invoked through the SVVariantAnnotator walker, which defines arguments common to all annotators.

The SuperArray annotator uses array intensity data to do a form of ''in silico'' validation of copy number variants.

This annotator requires that each variant indicate (through an INFO tag) the set of samples that are thought to carry the variant (either as homozygotes or heterozygotes). A Wilcoxon rank sum test is performed and a p-value calculated as follows: For each array probe underlying the variant, each sample is assigned an integral rank for that probe. Then the set of ranks (across all probes) is combined and treated as a set of observations for the Wilcoxon rank sum test. If there is more than one probe, there will certainly be ties (i.e. some sample will be rank 1 with respect to each probe). Ties are broken randomly to assign the final ranking.

A Wilcoxon rank sum test is then used to test whether the event-carrying samples (as indicated by the INFO tag) are shifted with respect to the non- event-carrying samples (for deletions, this is a one tailed test of a negative shift).

2. Inputs / Arguments

  • -arrayIntensityFile <data-file> : The path to an input file containing a matrix of array intensity values. : The file must be tab delimited with a header line. Each line of the file contains data for one probe. The first four columns should be named ID, CHR, START and END. ID is an identifier for the probe and the other three columns give the 1-relative coordinates of the probe (START and END can be equal). Columns beyond the first four provide intensity data for each sample. The header for each additional column should contain the sample ID.

  • -superArraySampleTag <tag-name> : This is the name of the INFO field that contains the list of carrier samples, which must be comma-separated. The default tag name is _SASAMPLES_. For example, SASAMPLES=NA12878,NA12891,NA12892 would indicate that three samples carry the variant.

  • -sample <sample-or-sample-list> : A subset of samples on which to perform the test (or a .list file of sample identifiers). The default behavior is to use all samples in the array intensity file.

  • -superArrayPermute <true/false> : If set to true, then the sample identities are permuted before performing the test to generate a null distribution.

3. Annotations

This annotator produces three INFO field annotations for each VCF record:

  • SANSAMPLES : The number of carrier samples.

  • SANPROBES : The number of probes underlying the event.

  • SAPVALUE : The calcualted p-value.

The annotator can also generate a tab-delimited report file containing these annotations.

4. Example

The SuperArray annotator requires Genome STRiP and R.

export SV_DIR=/path/to/SVToolkit/root/directory

java -Xmx4g -cp SVToolkit.jar:GenomeAnalysisTK.jar \
    org.broadinstitute.sting.gatk.CommandLineGATK \ 
    -T SVVariantAnnotator \ 
    -A SuperArray \ 
    -R /humgen/1kg/reference/human_g1k_v37.fasta \ 
    -BTI variant \
    -B:variant,VCF input.vcf \ 
    -O output.vcf \ 
    -arrayIntensityFile Omni25_superarray_intensity_matrix.dat \ 
    -sample discovery_samples.list \
    -superArraySampleTag SAMPLES \ 
    -writeReport \ 
    -reportFile superarray_output.dat

5. Performance

The SuperArray annotator uses R as well as java and can consume up to 10G of memory.

The SuperArray annotator uses an exact test in many cases and this can be expensive. If you want to test more than a few hundred variants, you should consider splitting up the input VCF file and processing them in parallel. In sample runs, testing 1000 variants in 1000 samples can take 2 to 3 hours.

Post edited by Geraldine_VdAuwera on

Geraldine Van der Auwera, PhD

Sign In or Register to comment.