Bug Bulletin: we have identified a bug that affects indexing when producing gzipped VCFs. This will be fixed in the upcoming 3.2 release; in the meantime you need to reindex gzipped VCFs using Tabix.

GenomeSTRiP Main Page

Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,204Administrator, GSA Member admin

1. Introduction

Genome STRiP (Genome STRucture In Populations) is a suite of tools for discovery and genotyping of structural variation using sequencing data. The methods used in Genome STRiP are designed to find shared variation using data from multiple individuals. Genome STRiP looks both across and within a set of sequenced genomes to detect variation.

Genome STRiP requires genomes from multiple individuals in order to detect or genotype variants. Typically 20 to 30 genomes are required to get good results. It is possible to use publicly available reference data (e.g. sequence data from the 1000 Genomes Project) as a background population to call events in single genomes, but this strategy has not been widely tried nor thoroughly evaluated.

Genome STRiP uses the the Genome Analysis Toolkit (GATK). There are pre-defined Queue pipelines to simplify running analyses.

The current release of Genome STRiP is focused on discovery and genotyping of deletions relative to a reference sequence. Extensions to support other types of structural variation are planned.

Genome STRiP is under active development and improvement. We are making current under-development versions available in the hopes that they may be of use to others.

To run the current versions successfully, you will need to read and understand how the method works and you may have to adapt the example scripts to your particular data set. Please report bugs through svtoolkit-help@lists.sourceforge.net.

Before posting, please review the FAQ.

2. Structure

Genome STRiP consists of a number of modules, related as shown below.

image

To perform discovery and genotyping, you would run all four modules in order: SVPreprocess, SVDiscovery, SVAltAlign, SVGenotyping. To genotype a set of known variants using new samples, you can skip the SVDiscovery step.

3. Inputs and Outputs

Genome STRiP requires aligned sequence data in BAM format.

The primary outputs from Genome STRiP are polymorphic sites of structural variation and/or genotypes for these sites, both of which are represented in VCF format.

Genome STRiP also requires a FASTA file containing the reference sequence used to align the input reads. The input FASTA file must be indexed using samtools faidx or the equivalent.

4. Downloading and Installation

Current and previous binary releases are available from our website http://www.broadinstitute.org/software/genomestrip.

To install, download the tarball and decompress into a suitable directory. You will need to install pre-requisite software as described below. There is a 10-minute installation/verification test in the installtest subdirectory. You will also need to download (or build) a suitable [[Genome_STRiP_Genome_Mask_Files|Genome Mask File]].

The test scripts also serve as example pipelines for running Genome STRiP.

Environment Variables

Currently, Genome STRiP requires you to set the SV_DIR environment variable to the installation directory. See the installtest scripts for details.

5. Dependencies

Java

Genome STRiP is written mostly in java and packaged as a jar file (SVToolkit.jar). You will need java 1.6.

GATK

Genome STRiP is integrated with the Genome Analysis Toolkit (GATK) and requires GenomeAnalysisTK.jar in order to run. The pipelines that automate running Genome STRiP are written as Queue scripts and these pipelines require Queue.jar to run.

The SVToolkit distribution comes with a set of compatible pre-built jar files for GATK and Queue. We can't promise source or binary compatibility between different versions of GATK and SVToolkit. If you mix and match versions, you are on your own and you should scrutinize your results carefully.

Picard

The Genome STRiP pipelines use some Picard standalone command line utilities. You will need to install these separately. URL: http://picard.sourceforge.net

Samtools

The pipelines use 'samtools index' to index BAM files. You will need to install samtools separately. URL: http://samtools.sourceforge.net

This dependency on samtools could in theory be replaced with Picard 'BuildBAMIndex', if you can't run samtools for some reason.

BWA

Several pipeline functions use BWA (the executable) and also use BWA through its C API. You will need to install BWA separately. URL: http://bio-bwa.sourceforget.net

A pre-built Linux shared library, libbwa.so, that is required by GenomeSTRiP comes with the SVToolkit distribution. This library is built from the BWA source code and source code that is part of GATK.

The current version of this library is built from BWA 0.5.8, but it should be compatible with most other versions of BWA. If you have problems, you can try running with the pre-built version of bwa included in the distribution that was built from the same version as the shared library.

R

Genome STRiP uses some R scripts internally.

To run Genome STRiP, R must be installed separately and the Rscript exectuable must be on your path.

Genome STRiP should run with R 2.8 and above and may run with older versions as well, but this has not been tested.

6. Running Genome STRiP

Before attempting to run Genome STRiP on your own data, please run the short installation test in the installtest subdirectory. This will ensure that your environment is set up properly. The test scripts also offer an example of how to organize your run directory structure and some sample end-to-end pipelines.

A number of pre-defined Queue pipeline scripts are provided to run the different phases of analysis in Genome STRiP. Queue is a flexible scala-based system for writing processing pipelines that can be distributed on compute farms. These pipeline scripts should be taken as example templates and they may need to be modified for your specific analysis.

Each processing step has a corresponding Queue pipeline script:

SVPreprocess

Preprocess a set of input BAM files to generate genome-wide metadata used by other Genome STRiP modules.

SVAltAlign

Re-alignment of reads from input BAM files to alternative alleles described in an input VCF file.

SVDiscovery

Run deletion discovery on a set of input BAM files, producing a VCF file of potentially variant sites.

SVGenotyper

Genotype a set of polymorphic structural variation loci described in a VCF file.

Genome STRiP Functions | Components

The Queue pipelines invoke a series of processing steps, most of which are implemented as GATK Walkers or as java utility programs. New pipelines can be constructed from these more elemental components. See Genome STRiP Functions for more information.

7. Support

We have set up a mailing list for bug reports and questions at svtoolkit-help@lists.sourceforge.net.

You can also consult the support page at http://sourceforge.net/projects/svtoolkit/support.

The FAQ is here.

Note that we are currently not distributing software through sourceforge. Software must be downloaded from our website http://www.broadinstitute.org/software/genomestrip.

GenomeStripSchematic.png
600 x 406 - 97K
Post edited by Geraldine_VdAuwera on

Geraldine Van der Auwera, PhD

Comments

  • elisoldelisold Posts: 1Member

    Hi. I'm just starting to learn bioinformatics and I find this genomeSTRIP very complex I must say. I have two questions - can you run it on non human data? I have BAM files for Plasmodium falciparum. And 2 does it work like a vcftools for calling small indels? and If so how would it compare to it? I have about 70 falciparum samples and I am looking for a tool to detect INDELs across all the samples, would you prefer any method?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,204Administrator, GSA Member admin

    Re: your second question, if what you want to do is call indels, it's GATK itself that you want. See the Guide section of this website for documentation.

    Geraldine Van der Auwera, PhD

  • bhandsakerbhandsaker Posts: 120Member, Third-party Developer ✭✭✭

    I might clarify here to say "small indels". Small is subjective, of course, but I think in practice tools like GATK do better for indel sizes up to around the length of your reads, whereas structural variation tools like Genome STRiP are designed for larger events and more power at several hundred bp and up, depending on sequencing depth.

    Regarding plasmodium, am I correct that the samples you have sequenced are haploid?

    Bob Handsaker, Broad Institute / Harvard Medical School Dept of Genetics

  • jfarrelljfarrell Posts: 28Member

    Can the GenomeSTRiP software use ReduceReads bam files as input?

  • bhandsakerbhandsaker Posts: 120Member, Third-party Developer ✭✭✭

    No, Genome STRiP doesn't work on GATK reduced bams.

    Bob Handsaker, Broad Institute / Harvard Medical School Dept of Genetics

  • SiyangLiuSiyangLiu Posts: 6Member

    Is GenomeSTRIP powerful in identification of other types of variants like insertions, inversions, translocations, etcs now? We know it is good at deletion detection and genotyping in 1000 genome Phase I analysis but we haven't seem the evaluation of performance of GenomeSTRIP over SVs other than deletions. In addition, I would like to ask about your suggestions on using GenomeSTRIP to capture non-deletions SVs.

  • bhandsakerbhandsaker Posts: 120Member, Third-party Developer ✭✭✭

    We have pipelines under development for calling other forms of variation, mostly copy number variation. These new pipelines aren't available for public use yet.

    Bob Handsaker, Broad Institute / Harvard Medical School Dept of Genetics

Sign In or Register to comment.