SVAltAlign Queue script

Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin
edited September 2012 in GenomeSTRiP Documentation

1. Introduction

SVAltAlign.q is a sample Queue script that is part of Genome STRiP.

This script realigned previously unmapped reads against putative alternate alleles generated from a VCF file describing a set of variants to be genotypes. The output is a merged bam file that contains these alignements to the alternate alleles. These alterante allele alignments are then used as input to genotyping.

2. Inputs / Arguments

  • -vcf <input-vcf-file> : A VCF file containing descriptions of the structural variations. : Only records for structural variations with precise breakpoints will be processed.

  • -I <bam-file> : The set of input BAM files containing records to realign.

  • -md <directory> : The metadata directory containing metadata about the input data set.

  • -R <fasta-file> : Reference sequence. : An indexed fasta file containing the reference sequence. The fasta file must be indexed with samtools faidx or the equivalent.

  • -altAlleleFlankLength <n> : The length of flanking sequence from the reference genome used during realignment (default 200).

  • -alignUnmappedMates <boolean> : Whether to align unmapped mates of mapped reads to the alternate alleles (default true). : If false, then unmapped reads with a POS field will not be ignored.

  • -configFile <configuration-file> : This file contains values for specialized settings that do not normally need to be changed. : A default configuration file is provided in conf/genstrip_parameters.txt.

3. Outputs

  • -O <bam-file> : The default output for this pipeline is a single merged bam file for all input bam files and all alternate alleles. : The sequence identifier for an alternate allele is VariantID_N where N is the index of the alternate allele in the VCF file (i.e. the first alternate allele is allele 1).

4. Running

The SVAltAlign.q script is run through Queue.

Because Genome STRiP is a third-party GATK library, the Queue command line must be invoked explicitly, as shown in the example below.

java -Xmx2g -cp Queue.jar:SVToolkit.jar:GenomeAnalysisTK.jar \
    org.broadinstitute.sting.queue.QCommandLine \ 
    -S SVAltAlign.q \ 
    -S SVQScript.q \ 
    -gatk GenomeAnalysisTK.jar \ 
    -cp SVToolkit.jar:GenomeAnalysisTK.jar \
    -configFile /path/to/svtoolkit/conf/genstrip_parameters.txt \ 
    -tempDir /path/to/tmp/dir \ 
    -md metadata \ 
    -R Homo_sapiens_assembly18.fasta \ 
    -vcf input.vcf \ 
    -I input1.bam -I input2.bam \ 
    -O output.bam \ 
    -run \ 
    -bsub \
    -jobQueue gsa \ 
    -jobProject 1KG \ 
    -jobLogDir logs 

5. Typical Queue Arguments

Queue typically requires the following arguments to run Genome STRiP pipelines.

  • -run : Actually run the pipeline (default is to do a dry run).

  • -S <queue-script> : Script to run. : The base script SVQScript.q from the SVToolkit should also be specified with a separate -S argument.

  • -gatk <jar-file> : The path to the GATK jar file.

  • -cp <classpath> : The java classpath to use for pipeline commands. This must include SVToolkit.jar and GenomeAnalysisTK.jar. : Note: Both -cp arguments are required in the example command. The first -cp argument is for the invocation of Queue itself, the second -cp argument is for the invocation of pipeline processes that will be run by Queue.

  • -tempDir <directory> : Path to a directory to use for temporary files.

6. Queue LSF Arguments

  • -bsub : Use LSF to submit jobs.

  • -jobQueue <queue-name> : LSF queue to use.

  • -jobProject <project-name> : LSF project to use for accounting.

  • -jobLogDir <directory> : Directory for LSF log files.

Post edited by Geraldine_VdAuwera on

Geraldine Van der Auwera, PhD

Sign In or Register to comment.