SVPreprocess Queue script

Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,171Administrator, GATK Dev admin
edited September 2012 in GenomeSTRiP Documentation

1. Introduction

SVPreprocess.q is a sample Queue script that is part of Genome STRiP.

This script preprocesses a set of input BAM files to generate genome-wide
metadata that will be used in subsequent phases of Genome STRiP.

2. Inputs / Arguments

  • -I <bam-file> : The set of input BAM files. : These files form a "data
    set" that will be analyzed by Genome STRiP. The BAM files must have
    appropriate headers including read group (RG) and sample (SM) tags.

  • -md <directory> : The metadata directory in which to store computed
    metadata about the input data set.

  • -R <fasta-file> : Reference sequence. : An indexed fasta file containing
    the reference sequence that the input BAM files were aligned against. The
    fasta file must be indexed with samtools faidx or the equivalent.

  • -genomeMaskFile <mask-file> : Mask file that describes the alignability of
    the reference sequence. See Genome Mask Files.

  • -genderMapFile <gender-map-file> : A file that contains the expected
    gender for each sample. Tab delimited file with sample ID and gender on each
    line. Gender can be specified as M/F or 1 (male) and 2 (female).

  • -configFile <configuration-file> : This file contains values for
    specialized settings that do not normally need to be changed. A default
    configuration file is provided in conf/genstrip_parameters.txt.

3. Outputs

The SVPreprocess pipeline produces a number of output files in the specified
metadata directory. The data files produced and the file formats used are
subject to change in future releases.

Currently, output files are produced in the following categories:

1. Insert size distributions (isd)

Binary files are generated that contain information on the distribution of
insert lengths for each library or read group in the input BAM files.
Normally, all of these are merged into one file called isd.hist.bin.

A text file, isd.stats.dat, is also produced that contains informative
statistics about each library or read group, including the median insert
length and the robust standard deviation (RSD). This file can be reviewed to
identify libraries with unusual insert size distributions that should be
withheld from analysis.

2. Read depth (depth)

The main output is a text file, depth.dat, containing the genome-wide
count of aligned fragments. The read counts are based on the filtering
parameters used (principally mapping quality).

3. Read span coverage (spans)

The main output is a text file, spans.dat, containing the genome-wide
coverage (in base pairs) that exists between the two alignments that comprise
a read pair. The span coverage is an approximation of the power to detect a
breakpoint by spanning read pairs.

4. Running

The SVPreprocess.q script is run through Queue.

Because Genome STRiP is a third-party GATK library, the Queue command line
must be invoked explicitly, as shown in the example below.

java -Xmx2g -cp Queue.jar:SVToolkit.jar:GenomeAnalysisTK.jar \
    org.broadinstitute.sting.queue.QCommandLine \ 
    -S SVPreprocess.q \ 
    -S SVQScript.q \ 
    -gatk GenomeAnalysisTK.jar \ 
    -cp SVToolkit.jar:GenomeAnalysisTK.jar \ 
    -configFile /path/to/svtoolkit/conf/genstrip_parameters.txt \ 
    -tempDir /path/to/tmp/dir \
    -md metadata \ 
    -R Homo_sapiens_assembly18.fasta \ 
    -genomeMaskFile Homo_sapiens_assembly18.mask.36.fasta \ 
    -genderMapFile sample_genders.map \ 
    -I input1.bam -I input2.bam \ 
    -run \ 
    -bsub \ 
    -jobQueue lsf_queue_name \
    -jobProject lsf_project \ 
    -jobLogDir logs 

5. Typical Queue Arguments

Queue typically requires the following arguments to run Genome STRiP
pipelines.

  • -run : Actually run the pipeline (default is to do a dry run).

  • -S <queue-script> : Script to run. The base script SVQScript.q from the
    SVToolkit should also be specified with a separate -S argument.

  • -gatk <jar-file> : The path to the GATK jar file.

  • -cp <classpath> : The java classpath to use for pipeline commands. This
    must include SVToolkit.jar and GenomeAnalysisTK.jar. : Note: Both -cp
    arguments are required in the example command. The first -cp argument is for
    the invocation of Queue itself, the second -cp argument is for the invocation
    of pipeline processes that will be run by Queue.

  • -tempDir <directory> : Path to a directory to use for temporary files.

6. Queue LSF Arguments

  • -bsub : Use LSF to submit jobs.

  • -jobQueue <queue-name> : LSF queue to use.

  • -jobProject <project-name> : LSF project to use for accounting.

  • -jobLogDir <directory> : Directory for LSF log files.

Post edited by Geraldine_VdAuwera on

Geraldine Van der Auwera, PhD

Sign In or Register to comment.