Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Documentation for IntensityRankSum annotator

bhandsakerbhandsaker Member, Broadie, Moderator admin

This post provides some documentation on using the IntensityRankSum annotator with Genome STRiP. This is part of the SVAnnotator framework.

Introduction

SVAnnotator is a general framework for generating annotations of different kinds on VCF files containing structural variant records.

SVAnnotator is conceptually similar to the GATK VariantAnnotator walker, but works somewhat differently and is tailored for use with structural variations.
Each annotator is like a plug-in and you can run one or more annotators over the same file in one invocation.

Each annotator can either add annotations (typically INFO fields) to the input VCF file, or it can generate a textual report (e.g. tab delimited)
for additional processing, or it can generate one or more summary reports (also text files) or all of the above. Different annotators may or may
not support all modes of operation.

Common Arguments

The following common arguments are handled by the SVAnnotator framework and are common to all annotators:

-A annotator

The name of the annotator(s) you want to run (the annotations you want to generate).

-vcf input-vcf-file

The input VCF file you want to annotate.
The input VCF file must be sorted in coordinate order based on the reference sequence.

-R reference-file

Indexed fasta file containing the reference sequence.

-writeReport true/false

Flag indicating that the annotator should print a report file (default false).

Typically, this will contain one line per VCF record and will contain the same information as the annotations that would be stored in the VCF file, but may be easier to parse.

-writeSummary true/false

Flag indicating that the annotator should write a summary file (default false).

Depending on the annotator, the summary file produces information organized in different ways.
For example, the VariantsPerSample annotator produces a summary file containing a row for each sample and the count of variants in that sample.

-reportDirectory directory

The directory to write report files and summary files (default is the current directory).

-tempDir directory

The directory to use store temporary files.

Outputs

-O output-file

The destination for the output VCF file.

If you do not specify -O, then the annotator will not produce VCF annotations (but may produce reports or summaries).

-reportFile file-path

The path and file name for the report file. This overrides -reportDirectory.

If not supplied, the report file is based on the name of the annotator (and will be in the report directory).

-summaryFile file-path

The path and file name for the summary file. This overrides -reportDirectory.

If not supplied, each annotator generates a default summary file name (if the annotator supports writing summary files).

IntensityRankSum Annotator

The IntensityRankSum annotator is invoked through the SVAnnotator framework, which defines arguments common to all annotators.

The IntensityRankSum annotator uses SNP array intensity data to do a form of "in silico" validation of copy number variants.
The general approach is to use a Wilcoxon rank sum test using the intensity data and to calculate a p-value as follows:
For each array probe underlying the variant, each sample is assigned an integral rank for that probe.
Then the set of ranks (across all probes) is combined and treated as a set of observations for the Wilcoxon rank sum test.
If there is more than one probe, there will certainly be ties (i.e. some sample will be rank 1 with respect to each probe).
Ties are broken randomly to assign the final ranking. The random seed is based on the input data, so identical inputs will
always produce identical results. The Wilcoxon rank sum test is applied to the final combined ranking to test whether the
event-carrying samples (see below) are shifted with respect to the non-event-carrying samples. This is a one-tailed test
of a negative shift (for deletions) or a positive shift (for duplications).
For CNVs, two tests are performed (and two p-values generated). A negative-shift test is made for samples with copy number
less than two (compared to the copy number two samples) and a positive-shift test is made for samples with copy number
greater than two (compared to the copy number two samples).

This annotator can be run in one of several modes, depending on whether you have genotypes for each sample or simply a list
of samples believed to carry the variant (either as homozygotes or heterozygotes). The -irsUseGenotypes flag indicates which
mode to use.

If you do not use genotypes (i.e. you are evaluating a set of discovered sites without genotypes), then the annotator will
expect an INFO tag on each variant to indicate the set of samples that are thought to carry the variant (either as homozygotes or heterozygotes).
The variants must be either deletions (SVTYPE=DEL) or duplications (SVTYPE=DUP).
In addition, you must supply a list of samples that were evaluated during discovery (using -sample) so that the list of non-carrier samples can be determined for each variant.

If you use genotypes, then the annotator will process any kind of CNV (SVTYPE=DEL, SVTYPE=DUP or SVTYPE=CNV).
Genotypes will be taken from the GT/GQ fields (if present) or from the CN/CNQ fields and the genotypes will be used to determine carrier and non-carrier samples.
Only DEL or DUP variants support the use of GT/GQ.

Annotator Arguments

-arrayIntensityFile file

The array intensity file must be supplied as a tab-delimited matrix of array intensity values.

The file consists of four fixed columns (ID, CHROM, START, END) and a variable number of additional columns, one per sample that has array data.
The file consists of a header line specifying these four columns and the name of each sample.
After the header line, the file contains one line per array-probe location and must be sorted in reference-sequence order.
Each array-probe location should have a single intensity value. For SNP probes, this is usually the sum of the A and B normalized probe intensities.
The START and END coordinates are generally the same, corresponding to the SNP position.

-sample sample(s) or .list file

For discovery variants, the list of samples used in discovery (which may be a subset of the samples in the array intensity file).

-irsUseGenotypes true/false (default false)

Whether to use genotypes or INFO tags to determine carrier samples.

-irsSampleTag infotag

When not using genotypes, the INFO tag on each variant that lists the carrier samples (default IRSSAMPLES).

-irsPermute true/false

If true, perform a random permutation on rankings before performing the Wilcoxon rank-sum test to generate a null distribution.

-filterGenotypes true/false (default true)

If false, do not use the GT:FT field in the input VCF file to filter genotypes.

-genotypeQualityThreshold threshold

Use this quality threshold to determine whether to filter genotypes based on GQ or CNQ.
This argument does not automatically disable genotype filters (GT:FT), so is often used in conjunction with -filterGenotypes false.

Annotations

This annotator produces the following INFO field annotations for each VCF record (or the corresponding columns in a report file):

IRS_NPROBES (NPROBES)

The number of array probes used for this variant.

IRS_NSAMPLES (NSAMPLES)

The number of affected samples.

IRS_LOWERNSAMPLES (LOWERNSAMPLES)

The number of carrier samples for the deletion p-value.

IRS_LOWERPVALUE (LOWERPVALUE)

The p-value based on testing deleted samples (carrier samples with copy number less than two).

IRS_UPPERNSAMPLES (UPPERNSAMPLES)

The number of carrier samples for the duplication p-value.

IRS_UPPERPVALUE (UPPERPVALUE)

The p-value based on testing duplicated samples (carrier samples with copy number greater than two).

IRS_PVALUE (PVALUE)

The lesser of IRS_LOWERPVALUE and IRS_UPPERPVALUE.
You should not use this minimum p-value for FDR estimation unless you compute and apply a suitable null expectation.
Instead, you should stratify your variants based on type (deletion, duplication, mixed) and estimate FDR separately using IRS_LOWERPVALUE and IRS_UPPERPVALUE.

Example

The IntensityRankSum annotator uses R as well as Genome STRiP.

export SV_DIR=/path/to/SVToolkit/root/directory

# Discovery example (using INFO tags)
java -Xmx4g -cp SVToolkit.jar:GenomeAnalysisTK.jar \
    org.broadinstitute.sv.main.SVAnnotator \
    -A IntensityRankSum \
    -R human_g1k_v37.fasta \
    -vcf input.vcf \
    -O output.vcf \
    -arrayIntensityFile ALL.genome.Omni25_probe_intensity_matrix.20110425.dat \
    -sample discovery_samples.list \
    -irsSampleTag IRSSAMPLES \
    -writeReport \
    -reportFile irs_output.report.dat

# Genotyped example
java -Xmx4g -cp SVToolkit.jar:GenomeAnalysisTK.jar \
    org.broadinstitute.sv.main.SVAnnotator \
    -A IntensityRankSum \
    -R human_g1k_v37.fasta \
    -vcf input.vcf \
    -O output.vcf \
    -arrayIntensityFile ALL.genome.Omni25_probe_intensity_matrix.20110425.dat \
    -irsUseGenotypes true \
    -writeReport \
    -reportFile irs_output.report.dat

Since

Release 1.04.1162

Comments

  • dantakidantaki La JollaMember

    Why are overlapping calls neglected. In my report file I see something like this.

    chr1: 25000-50000

    chr1: 42000-90000

    I cannot merge the calls because they do not overlap too much but the Pvalue for the second call is always NA. For any overlapping calls the first element always has a Pvalue but the following ones return NA.

    What gives?

  • dantakidantaki La JollaMember

    @dantaki said:
    Why are overlapping calls neglected. In my report file I see something like this.

    chr1: 25000-50000

    chr1: 42000-90000

    I cannot merge the calls because they do not overlap too much but the Pvalue for the second call is always NA. For any overlapping calls the first element always has a Pvalue but the following ones return NA.

    What gives?

    Maybe the documentation was not as detailed in 2014 but the problem is that the VCF input file has to be sorted based on the order found in the reference file. Note this is different from sortBed or UNIX sort. Check the .fai file svtoolkit uses for help.

Sign In or Register to comment.