Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
The SuperArray annotator is invoked through the
SVVariantAnnotator walker, which defines arguments
common to all annotators.
The SuperArray annotator uses array intensity data to do a form of ''in
silico'' validation of copy number variants.
This annotator requires that each variant indicate (through an INFO tag) the set of samples that are thought to carry the variant (either as homozygotes or heterozygotes). A Wilcoxon rank sum test is performed and a p-value calculated
as follows: For each array probe underlying the variant, each sample is
assigned an integral rank for that probe. Then the set of ranks (across all
probes) is combined and treated as a set of observations for the Wilcoxon rank
sum test. If there is more than one probe, there will certainly be ties (i.e.
some sample will be rank 1 with respect to each probe). Ties are broken
randomly to assign the final ranking.
A Wilcoxon rank sum test is then used to test whether the event-carrying
samples (as indicated by the INFO tag) are shifted with respect to the non-
event-carrying samples (for deletions, this is a one tailed test of a negative
2. Inputs / Arguments
-arrayIntensityFile <data-file>: The path to an input file containing a
matrix of array intensity values. : The file must be tab delimited with a
header line. Each line of the file contains data for one probe. The first four
columns should be named
IDis an identifier for the
probe and the other three columns give the 1-relative coordinates of the probe
ENDcan be equal). Columns beyond the first four provide intensity
data for each sample. The header for each additional column should contain the
-superArraySampleTag <tag-name>: This is the name of the INFO field that
contains the list of carrier samples, which must be comma-separated. The default tag name is
_SASAMPLES_. For example,
SASAMPLES=NA12878,NA12891,NA12892would indicate that three samples carry the
-sample <sample-or-sample-list>: A subset of samples on which to perform the test (or a .list file of sample identifiers). The default behavior is to use all samples in the array intensity file.
-superArrayPermute <true/false>: If set to true, then the sample identities are permuted before performing the test to generate a null distribution.
This annotator produces three INFO field annotations for each VCF record:
SANSAMPLES: The number of carrier samples.
SANPROBES: The number of probes underlying the event.
SAPVALUE: The calcualted p-value.
The annotator can also generate a tab-delimited report file containing these
The SuperArray annotator requires Genome STRiP and R.
export SV_DIR=/path/to/SVToolkit/root/directory java -Xmx4g -cp SVToolkit.jar:GenomeAnalysisTK.jar \ org.broadinstitute.sting.gatk.CommandLineGATK \ -T SVVariantAnnotator \ -A SuperArray \ -R /humgen/1kg/reference/human_g1k_v37.fasta \ -BTI variant \ -B:variant,VCF input.vcf \ -O output.vcf \ -arrayIntensityFile Omni25_superarray_intensity_matrix.dat \ -sample discovery_samples.list \ -superArraySampleTag SAMPLES \ -writeReport \ -reportFile superarray_output.dat
The SuperArray annotator uses R as well as java and can consume up to 10G of
The SuperArray annotator uses an exact test in many cases and this can be
expensive. If you want to test more than a few hundred variants, you should
consider splitting up the input VCF file and processing them in parallel. In
sample runs, testing 1000 variants in 1000 samples can take 2 to 3 hours.