SVAltAlign Queue script

1. Introduction

SVAltAlign.q is a sample Queue script that is part of Genome STRiP.

This script realigned previously unmapped reads against putative alternate
alleles generated from a VCF file describing a set of variants to be
genotypes. The output is a merged bam file that contains these alignements to
the alternate alleles. These alterante allele alignments are then used as
input to genotyping.

2. Inputs / Arguments

  • -vcf <input-vcf-file> : A VCF file containing descriptions of the
    structural variations. : Only records for structural variations with precise
    breakpoints will be processed.

  • -I <bam-file> : The set of input BAM files containing records to realign.

  • -md <directory> : The metadata directory containing metadata about the
    input data set.

  • -R <fasta-file> : Reference sequence. : An indexed fasta file containing
    the reference sequence. The fasta file must be indexed with samtools faidx
    or the equivalent.

  • -altAlleleFlankLength <n> : The length of flanking sequence from the
    reference genome used during realignment (default 200).

  • -alignUnmappedMates <boolean> : Whether to align unmapped mates of mapped
    reads to the alternate alleles (default true). : If false, then unmapped reads
    with a POS field will not be ignored.

  • -configFile <configuration-file> : This file contains values for
    specialized settings that do not normally need to be changed. : A default
    configuration file is provided in conf/genstrip_parameters.txt.

3. Outputs

  • -O <bam-file> : The default output for this pipeline is a single merged
    bam file for all input bam files and all alternate alleles. : The sequence
    identifier for an alternate allele is VariantID_N where N is the index of the
    alternate allele in the VCF file (i.e. the first alternate allele is allele

4. Running

The SVAltAlign.q script is run through Queue.

Because Genome STRiP is a third-party GATK library, the Queue command line
must be invoked explicitly, as shown in the example below.

java -Xmx2g -cp Queue.jar:SVToolkit.jar:GenomeAnalysisTK.jar \
    org.broadinstitute.sting.queue.QCommandLine \ 
    -S SVAltAlign.q \ 
    -S SVQScript.q \ 
    -gatk GenomeAnalysisTK.jar \ 
    -cp SVToolkit.jar:GenomeAnalysisTK.jar \
    -configFile /path/to/svtoolkit/conf/genstrip_parameters.txt \ 
    -tempDir /path/to/tmp/dir \ 
    -md metadata \ 
    -R Homo_sapiens_assembly18.fasta \ 
    -vcf input.vcf \ 
    -I input1.bam -I input2.bam \ 
    -O output.bam \ 
    -run \ 
    -bsub \
    -jobQueue gsa \ 
    -jobProject 1KG \ 
    -jobLogDir logs 

5. Typical Queue Arguments

Queue typically requires the following arguments to run Genome STRiP

  • -run : Actually run the pipeline (default is to do a dry run).

  • -S <queue-script> : Script to run. : The base script SVQScript.q from the
    SVToolkit should also be specified with a separate -S argument.

  • -gatk <jar-file> : The path to the GATK jar file.

  • -cp <classpath> : The java classpath to use for pipeline commands. This
    must include SVToolkit.jar and GenomeAnalysisTK.jar. : Note: Both -cp
    arguments are required in the example command. The first -cp argument is for
    the invocation of Queue itself, the second -cp argument is for the invocation
    of pipeline processes that will be run by Queue.

  • -tempDir <directory> : Path to a directory to use for temporary files.

6. Queue LSF Arguments

  • -bsub : Use LSF to submit jobs.

  • -jobQueue <queue-name> : LSF queue to use.

  • -jobProject <project-name> : LSF project to use for accounting.

  • -jobLogDir <directory> : Directory for LSF log files.

