The current GATK version is 3.7-0
Examples: Monday, today, last week, Mar 26, 3/26/04

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

You can opt in to receive email notifications, for example when your questions get answered or when there are new announcements, by following the instructions given here.

#### ☞ Formatting tip!

Wrap blocks of code, error messages and BAM/VCF snippets--especially content with hashes (#)--with lines with three backticks ( ` ) each to make a code block as demonstrated here.

GATK 3.7 is here! Be sure to read the Version Highlights and optionally the full Release Notes.

# GenomeSTRiP Main Page

edited June 2014

### 1. Introduction

Genome STRiP (Genome STRucture In Populations) is a suite of tools for discovery and genotyping of structural variation using sequencing data. The methods used in Genome STRiP are designed to find shared variation using data from multiple individuals. Genome STRiP looks both across and within a set of sequenced genomes to detect variation.

Genome STRiP requires genomes from multiple individuals in order to detect or genotype variants. Typically 20 to 30 genomes are required to get good results. It is possible to use publicly available reference data (e.g. sequence data from the 1000 Genomes Project) as a background population to call events in single genomes, but this strategy has not been widely tried nor thoroughly evaluated.

Genome STRiP uses the the Genome Analysis Toolkit (GATK). There are pre-defined Queue pipelines to simplify running analyses.

The current release of Genome STRiP is focused on discovery and genotyping of deletions relative to a reference sequence. Extensions to support other types of structural variation are planned.

Genome STRiP is under active development and improvement. We are making current under-development versions available in the hopes that they may be of use to others.

To run the current versions successfully, you will need to read and understand how the method works and you may have to adapt the example scripts to your particular data set.

Before posting, please review the FAQ.

### 2. Structure

Genome STRiP consists of a number of modules, related as shown below.

To perform discovery and genotyping, you would run all four modules in order: SVPreprocess, SVDiscovery, SVAltAlign, SVGenotyping. To genotype a set of known variants using new samples, you can skip the SVDiscovery step.

### 3. Inputs and Outputs

Genome STRiP requires aligned sequence data in BAM format.

The primary outputs from Genome STRiP are polymorphic sites of structural variation and/or genotypes for these sites,
both of which are represented in VCF format.

Genome STRiP also requires a FASTA file containing the reference sequence used to align the input reads. The input FASTA file must be indexed using samtools faidx or the equivalent.

Current and previous binary releases are available from our website http://www.broadinstitute.org/software/genomestrip.

You will need to install pre-requisite software as described below.
There is a 10-minute installation/verification test in the installtest subdirectory.

The test scripts also serve as example pipelines for running Genome STRiP.

#### Environment Variables

Currently, Genome STRiP requires you to set the SV_DIR environment variable to the installation directory.
See the installtest scripts for details.

### 5. Dependencies

#### Java

Genome STRiP is written mostly in java and packaged as a jar file (SVToolkit.jar).
You will need java 1.7.

#### GATK

Genome STRiP is integrated with the Genome Analysis Toolkit (GATK) and requires
GenomeAnalysisTK.jar in order to run.
The pipelines that automate running Genome STRiP are written as Queue scripts and these pipelines require Queue.jar to run.

The SVToolkit distribution comes with a set of compatible pre-built jar files for GATK and Queue. We can't promise source or binary compatibility between different versions of GATK and SVToolkit. If you mix and match versions, you are on your own and you should scrutinize your results carefully.

#### Picard

The Genome STRiP pipelines use some Picard standalone command line utilities.
You will need to install these separately. URL: http://picard.sourceforge.net

#### Samtools

The pipelines use 'samtools index' to index BAM files.
You will need to install samtools separately.
URL: http://samtools.sourceforge.net

This dependency on samtools could in theory be replaced with Picard 'BuildBAMIndex', if you can't run samtools for some reason.

#### BWA

Several pipeline functions use BWA (the executable) and also use BWA through its C API.
You will need to install BWA separately. URL: http://bio-bwa.sourceforget.net

A pre-built Linux shared library, libbwa.so, that is required by GenomeSTRiP comes with the SVToolkit distribution. This library is built from the BWA source code and source code that is part of GATK.

The current version of this library is built from BWA 0.5.8, but it should be compatible with most other versions of BWA. If you have problems, you can try running with the pre-built version of bwa included in the distribution that was built from the same version as the shared library.

#### R

Genome STRiP uses some R scripts internally.

To run Genome STRiP, R must be installed separately and the Rscript exectuable must be on your path.

Genome STRiP should run with R 2.8 and above and may run with older versions as well, but this has not been tested.

### 6. Running Genome STRiP

Before attempting to run Genome STRiP on your own data, please run the short installation test in the installtest subdirectory.
This will ensure that your environment is set up properly. The test scripts also offer an example of how to organize your
run directory structure and some sample end-to-end pipelines.

A number of pre-defined Queue pipeline scripts are provided to run the different phases of analysis in Genome STRiP.
Queue is a flexible scala-based system for writing processing pipelines that can be distributed on compute farms. These pipeline scripts should be taken as example templates and they may need to be modified for your specific analysis.

Each processing step has a corresponding Queue pipeline script:

#### SVPreprocess

Preprocess a set of input BAM files to generate genome-wide metadata used by other Genome STRiP modules.

#### SVAltAlign

Re-alignment of reads from input BAM files to alternative alleles described in an input VCF file.

#### SVDiscovery

Run deletion discovery on a set of input BAM files, producing a VCF file of potentially variant sites.

#### SVGenotyper

Genotype a set of polymorphic structural variation loci described in a VCF file.

#### Genome STRiP Functions | Components

The Queue pipelines invoke a series of processing steps, most of which are implemented as GATK Walkers or as java utility programs. New pipelines can be constructed from these more elemental components. See Genome STRiP Functions for more information.

### 7. Support

We have set up a mailing list for bug reports and questions at svtoolkit-help@lists.sourceforge.net.

You can also consult the support page at http://sourceforge.net/projects/svtoolkit/support.

The FAQ is here.

Note that we are currently not distributing software through sourceforge. Software must be downloaded from our website http://www.broadinstitute.org/software/genomestrip.

Geraldine Van der Auwera, PhD

Post edited by bhandsaker on
Tagged:

• Posts: 1

Hi. I'm just starting to learn bioinformatics and I find this genomeSTRIP very complex I must say. I have two questions - can you run it on non human data? I have BAM files for Plasmodium falciparum. And 2 does it work like a vcftools for calling small indels? and If so how would it compare to it? I have about 70 falciparum samples and I am looking for a tool to detect INDELs across all the samples, would you prefer any method?

Re: your second question, if what you want to do is call indels, it's GATK itself that you want. See the Guide section of this website for documentation.

Geraldine Van der Auwera, PhD

• Posts: 386 ✭✭✭

I might clarify here to say "small indels".
Small is subjective, of course, but I think in practice tools like GATK do better for indel sizes up to around the length of your reads, whereas structural variation tools like Genome STRiP are designed for larger events and more power at several hundred bp and up, depending on sequencing depth.

Regarding plasmodium, am I correct that the samples you have sequenced are haploid?

Bob Handsaker, Broad Institute / Harvard Medical School Dept of Genetics

• Posts: 72

Can the GenomeSTRiP software use ReduceReads bam files as input?

• Posts: 386 ✭✭✭

No, Genome STRiP doesn't work on GATK reduced bams.

Bob Handsaker, Broad Institute / Harvard Medical School Dept of Genetics

• Posts: 14

Is GenomeSTRIP powerful in identification of other types of variants like insertions, inversions, translocations, etcs now? We know it is good at deletion detection and genotyping in 1000 genome Phase I analysis but we haven't seem the evaluation of performance of GenomeSTRIP over SVs other than deletions. In addition, I would like to ask about your suggestions on using GenomeSTRIP to capture non-deletions SVs.

• Posts: 386 ✭✭✭

We have pipelines under development for calling other forms of variation, mostly copy number variation.
These new pipelines aren't available for public use yet.

Bob Handsaker, Broad Institute / Harvard Medical School Dept of Genetics

• China,DenmarkPosts: 10

Hi Bob,
I found that in the 1kgenome Phase3, there is CNV calling set of GenomeStrip in their newest vcf files. But I can't find any pipelines about it and I only can detect DELs with GenomeStrip. Would you mind to tell me the way that how GenomeStrip call CNV(copy number variation) in 1kGenome Phase3?

• Posts: 386 ✭✭✭

We should have a new major release ("Genome STRiP 2.0") coming out by the end of the month. This will include the new CNV calling pipeline that we applied in 1000 Genomes Phase 3.

There will also be a manuscript in Nature Genetics describing the methods.

Bob Handsaker, Broad Institute / Harvard Medical School Dept of Genetics

• University of Sussex, UKPosts: 118 ✭✭

Hi,

I was wondering what all the metrics stand for in the vcf files outputted by the Genomestrip SV and CNV pipelines. Apologies if this is in the documentation and I've missed it. I know some are fairly self-explanatory, but other such as cohFN, memb-statistic, depthNcalls, are a bit more abstract. It would probably be useful for other people to know what the terms means.

For SV they are:
CIEND
CIPOS
END
SVLEN
GSELENGTH
GSCOHERENCE
GSCOHFN
GSCOHPVALUE
GSMEMBSTATISTIC
GSDEPTHNTOTALSAMPLES
GSDEPTHCALLTHRESHOLD
GSDEPTHPVALUE
GSDEPTHRANKSUMPVALUE
GSNDEPTHCALLS
GSDEPTHRATIO

For CNV pipeline they are:
GSCNCATEGORY
GCFRACTION
GCLENGTH
GLALTSUM
GLHETSUM
GSCLUSTERSEP
GSCNQUAL