Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
SVPreprocess Queue script
SVPreprocess.q is a sample Queue script that is part of Genome STRiP.
This script preprocesses a set of input BAM files to generate genome-wide
metadata that will be used in subsequent phases of Genome STRiP.
2. Inputs / Arguments
-I <bam-file>: The set of input BAM files. : These files form a "data
set" that will be analyzed by Genome STRiP. The BAM files must have
appropriate headers including read group (RG) and sample (SM) tags.
-md <directory>: The metadata directory in which to store computed
metadata about the input data set.
-R <fasta-file>: Reference sequence. : An indexed fasta file containing
the reference sequence that the input BAM files were aligned against. The
fasta file must be indexed with
samtools faidxor the equivalent.
-genomeMaskFile <mask-file>: Mask file that describes the alignability of
the reference sequence. See Genome Mask Files.
-genderMapFile <gender-map-file>: A file that contains the expected
gender for each sample. Tab delimited file with sample ID and gender on each
line. Gender can be specified as M/F or 1 (male) and 2 (female).
-configFile <configuration-file>: This file contains values for
specialized settings that do not normally need to be changed. A default
configuration file is provided in
The SVPreprocess pipeline produces a number of output files in the specified
metadata directory. The data files produced and the file formats used are
subject to change in future releases.
Currently, output files are produced in the following categories:
1. Insert size distributions (isd)
Binary files are generated that contain information on the distribution of
insert lengths for each library or read group in the input BAM files.
Normally, all of these are merged into one file called
A text file,
isd.stats.dat, is also produced that contains informative
statistics about each library or read group, including the median insert
length and the robust standard deviation (RSD). This file can be reviewed to
identify libraries with unusual insert size distributions that should be
withheld from analysis.
2. Read depth (depth)
The main output is a text file,
depth.dat, containing the genome-wide
count of aligned fragments. The read counts are based on the filtering
parameters used (principally mapping quality).
3. Read span coverage (spans)
The main output is a text file,
spans.dat, containing the genome-wide
coverage (in base pairs) that exists between the two alignments that comprise
a read pair. The span coverage is an approximation of the power to detect a
breakpoint by spanning read pairs.
SVPreprocess.q script is run through Queue.
Because Genome STRiP is a third-party GATK library, the Queue command line
must be invoked explicitly, as shown in the example below.
java -Xmx2g -cp Queue.jar:SVToolkit.jar:GenomeAnalysisTK.jar \ org.broadinstitute.sting.queue.QCommandLine \ -S SVPreprocess.q \ -S SVQScript.q \ -gatk GenomeAnalysisTK.jar \ -cp SVToolkit.jar:GenomeAnalysisTK.jar \ -configFile /path/to/svtoolkit/conf/genstrip_parameters.txt \ -tempDir /path/to/tmp/dir \ -md metadata \ -R Homo_sapiens_assembly18.fasta \ -genomeMaskFile Homo_sapiens_assembly18.mask.36.fasta \ -genderMapFile sample_genders.map \ -I input1.bam -I input2.bam \ -run \ -bsub \ -jobQueue lsf_queue_name \ -jobProject lsf_project \ -jobLogDir logs
5. Typical Queue Arguments
Queue typically requires the following arguments to run Genome STRiP
-run: Actually run the pipeline (default is to do a dry run).
-S <queue-script>: Script to run. The base script
SVToolkit should also be specified with a separate
-gatk <jar-file>: The path to the GATK jar file.
-cp <classpath>: The java classpath to use for pipeline commands. This
GenomeAnalysisTK.jar. : Note: Both
arguments are required in the example command. The first
-cpargument is for
the invocation of Queue itself, the second
-cpargument is for the invocation
of pipeline processes that will be run by Queue.
-tempDir <directory>: Path to a directory to use for temporary files.
6. Queue LSF Arguments
-bsub: Use LSF to submit jobs.
-jobQueue <queue-name>: LSF queue to use.
-jobProject <project-name>: LSF project to use for accounting.
-jobLogDir <directory>: Directory for LSF log files.