Overview of Queue

Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,643Administrator, GATK Developer admin
edited October 16 in Pipelining with Queue

1. Introduction

GATK-Queue is command-line scripting framework for defining multi-stage genomic analysis pipelines combined with an execution manager that runs those pipelines from end-to-end. Often processing genome data includes several steps to produces outputs, for example our BAM to VCF calling pipeline include among other things:

  • Local realignment around indels
  • Emitting raw SNP calls
  • Emitting indels
  • Masking the SNPs at indels
  • Annotating SNPs using chip data
  • Labeling suspicious calls based on filters
  • Creating a summary report with statistics

Running these tools one by one in series may often take weeks for processing, or would require custom scripting to try and optimize using parallel resources.

With a Queue script users can semantically define the multiple steps of the pipeline and then hand off the logistics of running the pipeline to completion. Queue runs independent jobs in parallel, handles transient errors, and uses various techniques such as running multiple copies of the same program on different portions of the genome to produce outputs faster.


2. Obtaining Queue

You have two options: download the binary distribution (prepackaged, ready to run program) or build it from source.

- Download the binary

This is obviously the easiest way to go. Links are on the Downloads page. Just get the Queue package; no need to get the GATK package separately as GATK is bundled in with Queue.

- Building Queue from source

Briefly, here's what you need to know/do:

Queue is part of the GATK repository. Download the source from the public repository on Github. Run the following command:

git clone https://github.com/broadgsa/gatk.git

IMPORTANT NOTE: These instructions refer to the MIT-licensed version of the GATK+Queue source code. With that version, you will be able to build Queue itself, as well as the public portion of the GATK (the core framework), but that will not include the GATK analysis tools. If you want to use Queue to pipeline the GATK analysis tools, you need to clone the 'protected' repository. Please note however that part of the source code in that repository (the 'protected' module) is under a different license which excludes for-profit use, modification and redistribution.

Move to the git root directory and use maven to build the source.

mvn clean verify

All dependencies will be managed by Maven as needed.

See this article on how to test your installation of Queue.


3. Running Queue

See this article on running Queue for the first time for full details.

Queue arguments can be listed by running with --help

java -jar dist/Queue.jar --help

To list the arguments required by a QScript, add the script with -S and run with --help.

java -jar dist/Queue.jar -S script.scala --help

Note that by default queue runs in a "dry" mode, as explained in the link above. After verifying the generated commands execute the pipeline by adding -run.

See QFunction and Command Line Options for more info on adjusting Queue options.

4. QScripts

General Information

Queue pipelines are written as Scala 2.8 files with a bit of syntactic sugar, called QScripts.

Every QScript includes the following steps:

  • New instances of CommandLineFunctions are created
  • Input and output arguments are specified on each function
  • The function is added with add() to Queue for dispatch and monitoring

The basic command-line to run the Queue pipelines on the command line is

java -jar Queue.jar -S <script>.scala

See the main article Queue QScripts for more info on QScripts.

Supported QScripts

Most QScripts are analysis pipelines that are custom-built for specific projects, and we currently do not offer any QScripts as supported analysis tools. However, we do provide some example scripts that you can use as basis to write your own QScripts (see below).

Example QScripts

The latest version of the example files are available in the Sting github repository under public/scala/qscript/examples


5. Visualization and Queue

QJobReport

Queue automatically generates GATKReport-formatted runtime information about executed jobs. See this presentation for a general introduction to QJobReport.

Note that Queue attempts to generate a standard visualization using an R script in the GATK public/R repository. You must provide a path to this location if you want the script to run automatically. Additionally the script requires the gsalib to be installed on the machine, which is typically done by providing its path in your .Rprofile file:

bm8da-dbe ~/Desktop/broadLocal/GATK/unstable % cat ~/.Rprofile
.libPaths("/Users/depristo/Desktop/broadLocal/GATK/unstable/public/R/")

Note that gsalib is available from the CRAN repository so you can install it with the canonical R package install command.

Caveats

  • The system only provides information about commands that have just run. Resuming from a partially completed job will only show the information for the jobs that just ran, and not for any of the completed commands. This is due to a structural limitation in Queue, and will be fixed when the Queue infrastructure improves

  • This feature only works for command line and LSF execution models. SGE should be easy to add for a motivated individual but we cannot test this capabilities here at the Broad. Please send us a patch if you do extend Queue to support SGE.

DOT visualization of Pipelines

Queue emits a queue.dot file to help visualize your commands. You can open this file in programs like DOT, OmniGraffle, etc to view your pipelines. By default the system will print out your LSF command lines, but this can be too much in a complex pipeline.

To clarify your pipeline, override the dotString() function:

class CountCovariates(bamIn: File, recalDataIn: File, args: String = "") extends GatkFunction {
    @Input(doc="foo") var bam = bamIn
    @Input(doc="foo") var bamIndex = bai(bamIn)
    @Output(doc="foo") var recalData = recalDataIn
    memoryLimit = Some(4)
    override def dotString = "CountCovariates: %s [args %s]".format(bamIn.getName, args)
    def commandLine = gatkCommandLine("CountCovariates") + args + " -l INFO -D /humgen/gsa-hpprojects/GATK/data/dbsnp_129_hg18.rod -I %s --max_reads_at_locus 20000 -cov ReadGroupCovariate -cov QualityScoreCovariate -cov CycleCovariate -cov DinucCovariate -recalFile %s".format(bam, recalData)
}

Here we only see CountCovariates my.bam [-OQ], for example, in the dot file. The base quality score recalibration pipeline, as visualized by DOT, can be viewed here:

6. Further reading

Post edited by Geraldine_VdAuwera on

Geraldine Van der Auwera, PhD

Comments

Sign In or Register to comment.