GenomeSTRiP Main Page

Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,822Administrator, GATK Developer admin

1. Introduction

Genome STRiP (Genome STRucture In Populations) is a suite of tools for discovery and genotyping of structural variation using sequencing data. The methods used in Genome STRiP are designed to find shared variation using data from multiple individuals. Genome STRiP looks both across and within a set of sequenced genomes to detect variation.

Genome STRiP requires genomes from multiple individuals in order to detect or genotype variants. Typically 20 to 30 genomes are required to get good results. It is possible to use publicly available reference data (e.g. sequence data from the 1000 Genomes Project) as a background population to call events in single genomes, but this strategy has not been widely tried nor thoroughly evaluated.

Genome STRiP uses the the Genome Analysis Toolkit (GATK). There are pre-defined Queue pipelines to simplify running analyses.

The current release of Genome STRiP is focused on discovery and genotyping of deletions relative to a reference sequence. Extensions to support other types of structural variation are planned.

Genome STRiP is under active development and improvement. We are making current under-development versions available in the hopes that they may be of use to others.

To run the current versions successfully, you will need to read and understand how the method works and you may have to adapt the example scripts to your particular data set. Please report bugs through svtoolkit-help@lists.sourceforge.net.

Before posting, please review the FAQ.

2. Structure

Genome STRiP consists of a number of modules, related as shown below.

image

To perform discovery and genotyping, you would run all four modules in order: SVPreprocess, SVDiscovery, SVAltAlign, SVGenotyping. To genotype a set of known variants using new samples, you can skip the SVDiscovery step.

3. Inputs and Outputs

Genome STRiP requires aligned sequence data in BAM format.

The primary outputs from Genome STRiP are polymorphic sites of structural variation and/or genotypes for these sites, both of which are represented in VCF format.

Genome STRiP also requires a FASTA file containing the reference sequence used to align the input reads. The input FASTA file must be indexed using samtools faidx or the equivalent.

4. Downloading and Installation

Current and previous binary releases are available from our website http://www.broadinstitute.org/software/genomestrip.

To install, download the tarball and decompress into a suitable directory. You will need to install pre-requisite software as described below. There is a 10-minute installation/verification test in the installtest subdirectory. You will also need to download (or build) a suitable [[Genome_STRiP_Genome_Mask_Files|Genome Mask File]].

The test scripts also serve as example pipelines for running Genome STRiP.

Environment Variables

Currently, Genome STRiP requires you to set the SV_DIR environment variable to the installation directory. See the installtest scripts for details.

5. Dependencies

Java

Genome STRiP is written mostly in java and packaged as a jar file (SVToolkit.jar). You will need java 1.7.

GATK

Genome STRiP is integrated with the Genome Analysis Toolkit (GATK) and requires GenomeAnalysisTK.jar in order to run. The pipelines that automate running Genome STRiP are written as Queue scripts and these pipelines require Queue.jar to run.

The SVToolkit distribution comes with a set of compatible pre-built jar files for GATK and Queue. We can't promise source or binary compatibility between different versions of GATK and SVToolkit. If you mix and match versions, you are on your own and you should scrutinize your results carefully.

Picard

The Genome STRiP pipelines use some Picard standalone command line utilities. You will need to install these separately. URL: http://picard.sourceforge.net

Samtools

The pipelines use 'samtools index' to index BAM files. You will need to install samtools separately. URL: http://samtools.sourceforge.net

This dependency on samtools could in theory be replaced with Picard 'BuildBAMIndex', if you can't run samtools for some reason.

BWA

Several pipeline functions use BWA (the executable) and also use BWA through its C API. You will need to install BWA separately. URL: http://bio-bwa.sourceforget.net

A pre-built Linux shared library, libbwa.so, that is required by GenomeSTRiP comes with the SVToolkit distribution. This library is built from the BWA source code and source code that is part of GATK.

The current version of this library is built from BWA 0.5.8, but it should be compatible with most other versions of BWA. If you have problems, you can try running with the pre-built version of bwa included in the distribution that was built from the same version as the shared library.

R

Genome STRiP uses some R scripts internally.

To run Genome STRiP, R must be installed separately and the Rscript exectuable must be on your path.

Genome STRiP should run with R 2.8 and above and may run with older versions as well, but this has not been tested.

6. Running Genome STRiP

Before attempting to run Genome STRiP on your own data, please run the short installation test in the installtest subdirectory. This will ensure that your environment is set up properly. The test scripts also offer an example of how to organize your run directory structure and some sample end-to-end pipelines.

A number of pre-defined Queue pipeline scripts are provided to run the different phases of analysis in Genome STRiP. Queue is a flexible scala-based system for writing processing pipelines that can be distributed on compute farms. These pipeline scripts should be taken as example templates and they may need to be modified for your specific analysis.

Each processing step has a corresponding Queue pipeline script:

SVPreprocess

Preprocess a set of input BAM files to generate genome-wide metadata used by other Genome STRiP modules.

SVAltAlign

Re-alignment of reads from input BAM files to alternative alleles described in an input VCF file.

SVDiscovery

Run deletion discovery on a set of input BAM files, producing a VCF file of potentially variant sites.

SVGenotyper

Genotype a set of polymorphic structural variation loci described in a VCF file.

Genome STRiP Functions | Components

The Queue pipelines invoke a series of processing steps, most of which are implemented as GATK Walkers or as java utility programs. New pipelines can be constructed from these more elemental components. See Genome STRiP Functions for more information.

7. Support

We have set up a mailing list for bug reports and questions at svtoolkit-help@lists.sourceforge.net.

You can also consult the support page at http://sourceforge.net/projects/svtoolkit/support.

The FAQ is here.

Note that we are currently not distributing software through sourceforge. Software must be downloaded from our website http://www.broadinstitute.org/software/genomestrip.

GenomeStripSchematic.png
600 x 406 - 97K
Post edited by bhandsaker on

Geraldine Van der Auwera, PhD

Comments

Sign In or Register to comment.