SVDiscovery Queue script

Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,736Administrator, GATK Dev admin
edited September 2012 in GenomeSTRiP Documentation

1. Introduction

SVDiscovery.q is a sample Queue script that is part of Genome STRiP.

This script runs deletion discovery over an input data set based on a set of
bam files. The input bam files must have been previously run through the
SVPreprocess pipeline to generate auxilliary metadata.

2. Inputs / Arguments

  • -I <bam-file> : The set of input BAM files.

  • -md <directory> : The metadata directory in which to store computed
    metadata about the input data set.

  • -R <fasta-file> : Reference sequence. : An indexed fasta file containing
    the reference sequence that the input BAM files were aligned against. The
    fasta file must be indexed with 'samtools faidx' or the equivalent.

  • -genomeMaskFile <mask-file> : Mask file that describes the alignability of
    the reference sequence. : See Genome Mask

  • -genderMapFile <gender-map-file> : A file that contains the expected
    gender for each sample. : Tab delimited file with sample ID and gender on each
    line. Gender can be specified as M/F or 1 (male) and 2 (female).

  • -configFile <configuration-file> : This file contains values for
    specialized settings that do not normally need to be changed. : A default
    configuration file is provided in conf/genstrip_parameters.txt.

  • -runDirectory <directory> : Directory in which to place output files and
    intermediate run files.

  • -minimumSize <n> : The minimum size of an event that should be detected.

  • -maximumSize <n> : The maximum size of an event that should be detected. :
    This parameter also determines the size of the overlap between search windows
    when partitioning for parallel processing.

  • -windowSize <n> : For parallel processing, the size of each genomic locus
    to process in parallel. : The maximum size of an event that can be detected is
    also limited by the window size.

  • -windowPadding : This parameter specifies how far outside the search locus
    should be searched for informative read pairs. : This should be set based on
    the maximum insert sizes for read pairs in the input data set. : Ideally, we
    should estimate this parameter from the input data set.

3. Outputs

  • -O <vcf-file> : The main output is a VCF file containing candidate SV
    sites. : The output VCF file also contains a variety of metrics in the INFO
    field that should be used for filtering to select a final set of SV calls.

The SVDiscovery pipeline also produces a number of other intermediate output
files, useful mostly for debugging. The content of these files is not
documented and is subject to change. If the genome is processed in parallel,
there will be output from each parallel partition plus merged genome-wide

4. Running

The SVDiscovery.q script is run through Queue.

Because Genome STRiP is a third-party GATK library, the Queue command line
must be invoked explicitly, as shown in the example below.

java -Xmx2g -cp Queue.jar:SVToolkit.jar:GenomeAnalysisTK.jar \
    org.broadinstitute.sting.queue.QCommandLine \ 
    -S SVDiscovery.q \ 
    -S SVQScript.q \ 
    -gatk GenomeAnalysisTK.jar \ 
    -cp SVToolkit.jar:GenomeAnalysisTK.jar \ 
    -configFile conf/genstrip_parameters.txt \ 
    -tempDir /path/to/tmp/dir \ 
    -runDirectory run1 \ 
    -md metadata \ 
    -R Homo_sapiens_assembly18.fasta \ 
    -genomeMaskFile Homo_sapiens_assembly18.mask.36.fasta \ 
    -genderMapFile \ 
    -I input1.bam -I input2.bam \ 
    -O output.sites.vcf \ 
    -minimumSize 100 \
    -maximumSize 1000000 \ 
    -windowSize 10000000 \ 
    -windowPadding 10000 \ 
    -run \
    -bsub \ 
    -jobQueue lsf_queue_name \ 
    -jobProject lsf_project \ 
    -jobLogDir logs

5. Parallel Processing

The discovery pipeline is designed to allow parallelism across many
processors. Parallelism is achieved by partitioning the problem space and
running on overlapping genomic windows, then merging the output from each

One practical strategy, which was employed in the pilot phase of the 1000
Genomes Project and is as shown in the example above, is to search for
deletions between 100 bases and 1Mb in 10Mb windows (overlapping by 1Mb) using
10Kb padding, based on having paired-end libraries with relatively small
insert sizes (100 bases - 2Kb).

It is also possible to partition the problem space by size of event. For
example, you could search for small events (< 1Kb), medium sized events (1Kb -
100Kb) and large events (100Kb - 10Mb) in separate parallel runs, each with
suitable window sizes, and then merge the outputs together. This approach is
not currently implemented in an automated QScript, but if you are adventurous
you could do this fairly easily using the same underlying building blocks as
the standard pipeline.

6. Typical Queue Arguments

Queue typically requires the following arguments to run Genome STRiP

  • -run : Actually run the pipeline (default is to do a dry run).

  • -S <queue-script> : Script to run. : The base script SVQScript.q from the
    SVToolkit should also be specified with a separate -S argument.

  • -gatk <jar-file> : The path to the GATK jar file.

  • -cp <classpath> : The java classpath to use for pipeline commands. This
    must include SVToolkit.jar and GenomeAnalysisTK.jar. : Note: Both -cp
    arguments are required in the example command. The first -cp argument is for
    the invocation of Queue itself, the second -cp argument is for the invocation
    of pipeline processes that will be run by Queue.

  • -tempDir <directory> : Path to a directory to use for temporary files.

7. Queue LSF Arguments

  • -bsub : Use LSF to submit jobs.

  • -jobQueue <queue-name> : LSF queue to use.

  • -jobProject <project-name> : LSF project to use for accounting.

  • -jobLogDir <directory> : Directory for LSF log files.

Post edited by Geraldine_VdAuwera on

Geraldine Van der Auwera, PhD

Sign In or Register to comment.