SVPreprocess Queue script

Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,672Administrator, GATK Developer admin
edited September 2012 in GenomeSTRiP Documentation

1. Introduction

SVPreprocess.q is a sample Queue script that is part of Genome STRiP.

This script preprocesses a set of input BAM files to generate genome-wide metadata that will be used in subsequent phases of Genome STRiP.

2. Inputs / Arguments

  • -I <bam-file> : The set of input BAM files. : These files form a "data set" that will be analyzed by Genome STRiP. The BAM files must have appropriate headers including read group (RG) and sample (SM) tags.

  • -md <directory> : The metadata directory in which to store computed metadata about the input data set.

  • -R <fasta-file> : Reference sequence. : An indexed fasta file containing the reference sequence that the input BAM files were aligned against. The fasta file must be indexed with samtools faidx or the equivalent.

  • -genomeMaskFile <mask-file> : Mask file that describes the alignability of the reference sequence. See Genome Mask Files.

  • -genderMapFile <gender-map-file> : A file that contains the expected gender for each sample. Tab delimited file with sample ID and gender on each line. Gender can be specified as M/F or 1 (male) and 2 (female).

  • -configFile <configuration-file> : This file contains values for specialized settings that do not normally need to be changed. A default configuration file is provided in conf/genstrip_parameters.txt.

3. Outputs

The SVPreprocess pipeline produces a number of output files in the specified metadata directory. The data files produced and the file formats used are subject to change in future releases.

Currently, output files are produced in the following categories:

1. Insert size distributions (isd)

Binary files are generated that contain information on the distribution of insert lengths for each library or read group in the input BAM files. Normally, all of these are merged into one file called isd.hist.bin.

A text file, isd.stats.dat, is also produced that contains informative statistics about each library or read group, including the median insert length and the robust standard deviation (RSD). This file can be reviewed to identify libraries with unusual insert size distributions that should be withheld from analysis.

2. Read depth (depth)

The main output is a text file, depth.dat, containing the genome-wide count of aligned fragments. The read counts are based on the filtering parameters used (principally mapping quality).

3. Read span coverage (spans)

The main output is a text file, spans.dat, containing the genome-wide coverage (in base pairs) that exists between the two alignments that comprise a read pair. The span coverage is an approximation of the power to detect a breakpoint by spanning read pairs.

4. Running

The SVPreprocess.q script is run through Queue.

Because Genome STRiP is a third-party GATK library, the Queue command line must be invoked explicitly, as shown in the example below.

java -Xmx2g -cp Queue.jar:SVToolkit.jar:GenomeAnalysisTK.jar \
    org.broadinstitute.sting.queue.QCommandLine \ 
    -S SVPreprocess.q \ 
    -S SVQScript.q \ 
    -gatk GenomeAnalysisTK.jar \ 
    -cp SVToolkit.jar:GenomeAnalysisTK.jar \ 
    -configFile /path/to/svtoolkit/conf/genstrip_parameters.txt \ 
    -tempDir /path/to/tmp/dir \
    -md metadata \ 
    -R Homo_sapiens_assembly18.fasta \ 
    -genomeMaskFile Homo_sapiens_assembly18.mask.36.fasta \ 
    -genderMapFile sample_genders.map \ 
    -I input1.bam -I input2.bam \ 
    -run \ 
    -bsub \ 
    -jobQueue lsf_queue_name \
    -jobProject lsf_project \ 
    -jobLogDir logs 

5. Typical Queue Arguments

Queue typically requires the following arguments to run Genome STRiP pipelines.

  • -run : Actually run the pipeline (default is to do a dry run).

  • -S <queue-script> : Script to run. The base script SVQScript.q from the SVToolkit should also be specified with a separate -S argument.

  • -gatk <jar-file> : The path to the GATK jar file.

  • -cp <classpath> : The java classpath to use for pipeline commands. This must include SVToolkit.jar and GenomeAnalysisTK.jar. : Note: Both -cp arguments are required in the example command. The first -cp argument is for the invocation of Queue itself, the second -cp argument is for the invocation of pipeline processes that will be run by Queue.

  • -tempDir <directory> : Path to a directory to use for temporary files.

6. Queue LSF Arguments

  • -bsub : Use LSF to submit jobs.

  • -jobQueue <queue-name> : LSF queue to use.

  • -jobProject <project-name> : LSF project to use for accounting.

  • -jobLogDir <directory> : Directory for LSF log files.

Post edited by Geraldine_VdAuwera on

Geraldine Van der Auwera, PhD

Sign In or Register to comment.