Bug Bulletin: The recent 3.2 release fixes many issues. If you run into a problem, please try the latest version before posting a bug report, as your problem may already have been solved.

Queue pipeline scripts (QScripts)

Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,821Administrator, GATK Developer admin
edited February 3 in Queue

1. Introduction

Queue pipelines are Scala 2.8 files with a bit of syntactic sugar, called QScripts. Check out the following as references.

QScripts are easiest to develop using an Integrated Development Environment. See Queue with IntelliJ IDEA for our recommended settings.

The following is a basic outline of a QScript:

import org.broadinstitute.sting.queue.QScript
// List other imports here

// Define the overall QScript here.
class MyScript extends QScript {
// List script arguments here.
@Input(doc="My QScript inputs")
var scriptInput: File = _

// Create and add the functions in the script here.
def script = {
var myCL = new MyCommandLine
myCL.myInput = scriptInput // Example variable input
myCL.myOutput = new File("/path/to/output") // Example hardcoded output
add(myCL)
}

}

2. Imports

Imports can be any scala or java imports in scala syntax.

import java.io.File
import scala.util.Random
import org.favorite.my._
// etc.

3. Classes

  • To add a CommandLineFunction to a pipeline, a class must be defined that extends QScript.

  • The QScript must define a method script.

  • The QScript can define helper methods or variables.

4. Script method

The body of script should create and add Queue CommandlineFunctions.

class MyScript extends org.broadinstitute.sting.queue.QScript {
def script = add(new CommandLineFunction { def commandLine = "echo hello world" })
}

5. Command Line Arguments

  • A QScript canbe set to read command line arguments by defining variables with @Input, @Output, or @Argument annotations.

  • A command line argument can be a primitive scalar, enum, File, or scala immutable Array, List, Set, or Option of a primitive, enum, or File.

  • QScript command line arguments can be marked as optional by setting required=false.

    class MyScript extends org.broadinstitute.sting.queue.QScript {
    @Input(doc="example message to echo")
    var message: String = _
    def script = add(new CommandLineFunction { def commandLine = "echo " + message })
    }

6. Using and writing CommandLineFunctions

Adding existing GATK walkers

See Pipelining the GATK using Queue for more information on the automatically generated Queue wrappers for GATK walkers.

After functions are defined they should be added to the QScript pipeline using add().

for (vcf <- vcfs) {
val ve = new VariantEval
ve.vcfFile = vcf
ve.evalFile = swapExt(vcf, "vcf", "eval")
add(ve)
}

Defining new CommandLineFunctions

  • Queue tracks dependencies between functions via variables annotated with @Input and @Output.

  • Queue will run functions based on the dependencies between them, not based on the order in which they are added in the script! So if the @Input of CommandLineFunction A depends on the @Output of ComandLineFunction B, A will wait for B to finish before it starts running.

  • See the main article Queue CommandLineFunctions for more information.

7. Examples

  • The latest version of the example files are available in the Sting git repository under public/scala/qscript/org/broadinstitute/sting/queue/qscripts/examples/.

  • To print the list of arguments required by an existing QScript run with -help.

  • To check if your script has all of the CommandLineFunction variables set correctly, run without -run.
  • When you are ready to execute the full pipeline, add -run.

Hello World QScript

The following is a "hello world" example that runs a single command line to echo hello world.

import org.broadinstitute.sting.queue.QScript

class HelloWorld extends QScript {
def script = {
add(new CommandLineFunction {
def commandLine = "echo hello world"
})
}
}

The above file is checked into the Sting git repository under HelloWorld.scala. After building Queue from source, the QScript can be run with the following command:

java -Djava.io.tmpdir=tmp -jar dist/Queue.jar -S public/scala/qscript/org/broadinstitute/sting/queue/qscripts/examples/HelloWorld.scala -run

It should produce output similar to:

INFO  16:23:27,825 QScriptManager - Compiling 1 QScript 
INFO 16:23:31,289 QScriptManager - Compilation complete
INFO 16:23:34,631 HelpFormatter - ---------------------------------------------------------
INFO 16:23:34,631 HelpFormatter - Program Name: org.broadinstitute.sting.queue.QCommandLine
INFO 16:23:34,632 HelpFormatter - Program Args: -S public/scala/qscript/org/broadinstitute/sting/queue/qscripts/examples/HelloWorld.scala -run
INFO 16:23:34,632 HelpFormatter - Date/Time: 2011/01/14 16:23:34
INFO 16:23:34,632 HelpFormatter - ---------------------------------------------------------
INFO 16:23:34,632 HelpFormatter - ---------------------------------------------------------
INFO 16:23:34,634 QCommandLine - Scripting HelloWorld
INFO 16:23:34,651 QCommandLine - Added 1 functions
INFO 16:23:34,651 QGraph - Generating graph.
INFO 16:23:34,660 QGraph - Running jobs.
INFO 16:23:34,689 ShellJobRunner - Starting: echo hello world
INFO 16:23:34,689 ShellJobRunner - Output written to /Users/kshakir/src/Sting/Q-43031@bmef8-d8e-1.out
INFO 16:23:34,771 ShellJobRunner - Done: echo hello world
INFO 16:23:34,773 QGraph - Deleting intermediate files.
INFO 16:23:34,773 QCommandLine - Done

ExampleUnifiedGenotyper.scala

This example uses automatically generated Queue compatible wrappers for the GATK. See Pipelining the GATK using Queue for more info on authoring Queue support into walkers and using walkers in Queue.

The ExampleUnifiedGenotyper.scala for running the UnifiedGenotyper followed by VariantFiltration can be found in the examples folder.

To list the command line parameters, including the required parameters, run with -help.

java -jar dist/Queue.jar -S public/scala/qscript/org/broadinstitute/sting/queue/qscripts/examples/ExampleUnifiedGenotyper.scala -help

The help output should appear similar to this:

INFO  10:26:08,491 QScriptManager - Compiling 1 QScript
INFO 10:26:11,926 QScriptManager - Compilation complete
---------------------------------------------------------
Program Name: org.broadinstitute.sting.queue.QCommandLine
---------------------------------------------------------
---------------------------------------------------------
usage: java -jar Queue.jar -S <script> [-run] [-jobRunner <job_runner>] [-bsub] [-status] [-retry <retry_failed>]
[-startFromScratch] [-keepIntermediates] [-statusTo <status_email_to>] [-statusFrom <status_email_from>] [-dot
<dot_graph>] [-expandedDot <expanded_dot_graph>] [-jobPrefix <job_name_prefix>] [-jobProject <job_project>] [-jobQueue
<job_queue>] [-jobPriority <job_priority>] [-memLimit <default_memory_limit>] [-runDir <run_directory>] [-tempDir
<temp_directory>] [-jobSGDir <job_scatter_gather_directory>] [-emailHost <emailSmtpHost>] [-emailPort <emailSmtpPort>]
[-emailTLS] [-emailSSL] [-emailUser <emailUsername>] [-emailPassFile <emailPasswordFile>] [-emailPass <emailPassword>]
[-l <logging_level>] [-log <log_to_file>] [-quiet] [-debug] [-h] -R <referencefile> -I <bamfile> [-L <intervals>]
[-filter <filternames>] [-filterExpression <filterexpressions>]

-S,--script <script> QScript scala file
-run,--run_scripts Run QScripts. Without this flag set only
performs a dry run.
-jobRunner,--job_runner <job_runner> Use the specified job runner to dispatch
command line jobs
-bsub,--bsub Equivalent to -jobRunner Lsf706
-status,--status Get status of jobs for the qscript
-retry,--retry_failed <retry_failed> Retry the specified number of times after a
command fails. Defaults to no retries.
-startFromScratch,--start_from_scratch Runs all command line functions even if the
outputs were previously output successfully.
-keepIntermediates,--keep_intermediate_outputs After a successful run keep the outputs of
any Function marked as intermediate.
-statusTo,--status_email_to <status_email_to> Email address to send emails to upon
completion or on error.
-statusFrom,--status_email_from <status_email_from> Email address to send emails from upon
completion or on error.
-dot,--dot_graph <dot_graph> Outputs the queue graph to a .dot file. See:
http://en.wikipedia.org/wiki/DOT_language
-expandedDot,--expanded_dot_graph <expanded_dot_graph> Outputs the queue graph of scatter gather to
a .dot file. Otherwise overwrites the
dot_graph
-jobPrefix,--job_name_prefix <job_name_prefix> Default name prefix for compute farm jobs.
-jobProject,--job_project <job_project> Default project for compute farm jobs.
-jobQueue,--job_queue <job_queue> Default queue for compute farm jobs.
-jobPriority,--job_priority <job_priority> Default priority for jobs.
-memLimit,--default_memory_limit <default_memory_limit> Default memory limit for jobs, in gigabytes.
-runDir,--run_directory <run_directory> Root directory to run functions from.
-tempDir,--temp_directory <temp_directory> Temp directory to pass to functions.
-jobSGDir,--job_scatter_gather_directory <job_scatter_gather_directory> Default directory to place scatter gather
output for compute farm jobs.
-emailHost,--emailSmtpHost <emailSmtpHost> Email SMTP host. Defaults to localhost.
-emailPort,--emailSmtpPort <emailSmtpPort> Email SMTP port. Defaults to 465 for ssl,
otherwise 25.
-emailTLS,--emailUseTLS Email should use TLS. Defaults to false.
-emailSSL,--emailUseSSL Email should use SSL. Defaults to false.
-emailUser,--emailUsername <emailUsername> Email SMTP username. Defaults to none.
-emailPassFile,--emailPasswordFile <emailPasswordFile> Email SMTP password file. Defaults to none.
-emailPass,--emailPassword <emailPassword> Email SMTP password. Defaults to none. Not
secure! See emailPassFile.
-l,--logging_level <logging_level> Set the minimum level of logging, i.e.
setting INFO get's you INFO up to FATAL,
setting ERROR gets you ERROR and FATAL level
logging.
-log,--log_to_file <log_to_file> Set the logging location
-quiet,--quiet_output_mode Set the logging to quiet mode, no output to
stdout
-debug,--debug_mode Set the logging file string to include a lot
of debugging information (SLOW!)
-h,--help Generate this help message

Arguments for ExampleUnifiedGenotyper:
-R,--referencefile <referencefile> The reference file for the bam files.
-I,--bamfile <bamfile> Bam file to genotype.
-L,--intervals <intervals> An optional file with a list of intervals to proccess.
-filter,--filternames <filternames> A optional list of filter names.
-filterExpression,--filterexpressions <filterexpressions> An optional list of filter expressions.


##### ERROR ------------------------------------------------------------------------------------------
##### ERROR stack trace
org.broadinstitute.sting.commandline.MissingArgumentException:
Argument with name '--bamfile' (-I) is missing.
Argument with name '--referencefile' (-R) is missing.
at org.broadinstitute.sting.commandline.ParsingEngine.validate(ParsingEngine.java:192)
at org.broadinstitute.sting.commandline.ParsingEngine.validate(ParsingEngine.java:172)
at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:199)
at org.broadinstitute.sting.queue.QCommandLine$.main(QCommandLine.scala:57)
at org.broadinstitute.sting.queue.QCommandLine.main(QCommandLine.scala)
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A GATK RUNTIME ERROR has occurred (version 1.0.5504):
##### ERROR
##### ERROR Please visit the wiki to see if this is a known problem
##### ERROR If not, please post the error, with stack trace, to the GATK forum
##### ERROR Visit our wiki for extensive documentation http://www.broadinstitute.org/gsa/wiki
##### ERROR Visit our forum to view answers to commonly asked questions http://getsatisfaction.com/gsa
##### ERROR
##### ERROR MESSAGE: Argument with name '--bamfile' (-I) is missing.
##### ERROR Argument with name '--referencefile' (-R) is missing.
##### ERROR ------------------------------------------------------------------------------------------

To dry run the pipeline:

java \
-Djava.io.tmpdir=tmp \
-jar dist/Queue.jar \
-S public/scala/qscript/org/broadinstitute/sting/queue/qscripts/examples/ExampleUnifiedGenotyper.scala \
-R human_b36_both.fasta \
-I pilot2_daughters.chr20.10k-11k.bam \
-L chr20.interval_list \
-filter StrandBias -filterExpression "SB>=0.10" \
-filter AlleleBalance -filterExpression "AB>=0.75" \
-filter QualByDepth -filterExpression "QD<5" \
-filter HomopolymerRun -filterExpression "HRun>=4"

The dry run output should appear similar to this:

INFO  10:45:00,354 QScriptManager - Compiling 1 QScript
INFO 10:45:04,855 QScriptManager - Compilation complete
INFO 10:45:05,058 HelpFormatter - ---------------------------------------------------------
INFO 10:45:05,059 HelpFormatter - Program Name: org.broadinstitute.sting.queue.QCommandLine
INFO 10:45:05,059 HelpFormatter - Program Args: -S public/scala/qscript/org/broadinstitute/sting/queue/qscripts/examples/ExampleUnifiedGenotyper.scala -R human_b36_both.fasta -I pilot2_daughters.chr20.10k-11k.bam -L chr20.interval_list -filter StrandBias -filterExpression SB>=0.10 -filter AlleleBalance -filterExpression AB>=0.75 -filter QualByDepth -filterExpression QD<5 -filter HomopolymerRun -filterExpression HRun>=4
INFO 10:45:05,059 HelpFormatter - Date/Time: 2011/03/24 10:45:05
INFO 10:45:05,059 HelpFormatter - ---------------------------------------------------------
INFO 10:45:05,059 HelpFormatter - ---------------------------------------------------------
INFO 10:45:05,061 QCommandLine - Scripting ExampleUnifiedGenotyper
INFO 10:45:05,150 QCommandLine - Added 4 functions
INFO 10:45:05,150 QGraph - Generating graph.
INFO 10:45:05,169 QGraph - Generating scatter gather jobs.
INFO 10:45:05,182 QGraph - Removing original jobs.
INFO 10:45:05,183 QGraph - Adding scatter gather jobs.
INFO 10:45:05,231 QGraph - Regenerating graph.
INFO 10:45:05,247 QGraph - -------
INFO 10:45:05,252 QGraph - Pending: IntervalScatterFunction /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/temp-1/scatter.intervals /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/temp-2/scatter.intervals /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/temp-3/scatter.intervals
INFO 10:45:05,253 QGraph - Log: /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/scatter/Q-60018@bmef8-d8e-1.out
INFO 10:45:05,254 QGraph - -------
INFO 10:45:05,279 QGraph - Pending: java -Xmx2g -Djava.io.tmpdir=/Users/kshakir/src/Sting/tmp -cp "/Users/kshakir/src/Sting/dist/Queue.jar" org.broadinstitute.sting.gatk.CommandLineGATK -T UnifiedGenotyper -I /Users/kshakir/src/Sting/pilot2_daughters.chr20.10k-11k.bam -L /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/temp-1/scatter.intervals -R /Users/kshakir/src/Sting/human_b36_both.fasta -o /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/temp-1/pilot2_daughters.chr20.10k-11k.unfiltered.vcf
INFO 10:45:05,279 QGraph - Log: /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/temp-1/Q-60018@bmef8-d8e-1.out
INFO 10:45:05,279 QGraph - -------
INFO 10:45:05,283 QGraph - Pending: java -Xmx2g -Djava.io.tmpdir=/Users/kshakir/src/Sting/tmp -cp "/Users/kshakir/src/Sting/dist/Queue.jar" org.broadinstitute.sting.gatk.CommandLineGATK -T UnifiedGenotyper -I /Users/kshakir/src/Sting/pilot2_daughters.chr20.10k-11k.bam -L /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/temp-2/scatter.intervals -R /Users/kshakir/src/Sting/human_b36_both.fasta -o /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/temp-2/pilot2_daughters.chr20.10k-11k.unfiltered.vcf
INFO 10:45:05,283 QGraph - Log: /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/temp-2/Q-60018@bmef8-d8e-1.out
INFO 10:45:05,283 QGraph - -------
INFO 10:45:05,287 QGraph - Pending: java -Xmx2g -Djava.io.tmpdir=/Users/kshakir/src/Sting/tmp -cp "/Users/kshakir/src/Sting/dist/Queue.jar" org.broadinstitute.sting.gatk.CommandLineGATK -T UnifiedGenotyper -I /Users/kshakir/src/Sting/pilot2_daughters.chr20.10k-11k.bam -L /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/temp-3/scatter.intervals -R /Users/kshakir/src/Sting/human_b36_both.fasta -o /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/temp-3/pilot2_daughters.chr20.10k-11k.unfiltered.vcf
INFO 10:45:05,287 QGraph - Log: /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/temp-3/Q-60018@bmef8-d8e-1.out
INFO 10:45:05,288 QGraph - -------
INFO 10:45:05,288 QGraph - Pending: SimpleTextGatherFunction /Users/kshakir/src/Sting/Q-60018@bmef8-d8e-1.out
INFO 10:45:05,288 QGraph - Log: /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/gather-jobOutputFile/Q-60018@bmef8-d8e-1.out
INFO 10:45:05,289 QGraph - -------
INFO 10:45:05,291 QGraph - Pending: java -Xmx1g -Djava.io.tmpdir=/Users/kshakir/src/Sting/tmp -cp "/Users/kshakir/src/Sting/dist/Queue.jar" org.broadinstitute.sting.gatk.CommandLineGATK -T CombineVariants -L /Users/kshakir/src/Sting/chr20.interval_list -R /Users/kshakir/src/Sting/human_b36_both.fasta -B:input0,VCF /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/temp-1/pilot2_daughters.chr20.10k-11k.unfiltered.vcf -B:input1,VCF /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/temp-2/pilot2_daughters.chr20.10k-11k.unfiltered.vcf -B:input2,VCF /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/temp-3/pilot2_daughters.chr20.10k-11k.unfiltered.vcf -o /Users/kshakir/src/Sting/pilot2_daughters.chr20.10k-11k.unfiltered.vcf -priority input0,input1,input2 -assumeIdenticalSamples
INFO 10:45:05,291 QGraph - Log: /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/gather-out/Q-60018@bmef8-d8e-1.out
INFO 10:45:05,292 QGraph - -------
INFO 10:45:05,296 QGraph - Pending: java -Xmx2g -Djava.io.tmpdir=/Users/kshakir/src/Sting/tmp -cp "/Users/kshakir/src/Sting/dist/Queue.jar" org.broadinstitute.sting.gatk.CommandLineGATK -T VariantEval -L /Users/kshakir/src/Sting/chr20.interval_list -R /Users/kshakir/src/Sting/human_b36_both.fasta -B:eval,VCF /Users/kshakir/src/Sting/pilot2_daughters.chr20.10k-11k.unfiltered.vcf -o /Users/kshakir/src/Sting/pilot2_daughters.chr20.10k-11k.unfiltered.eval
INFO 10:45:05,296 QGraph - Log: /Users/kshakir/src/Sting/Q-60018@bmef8-d8e-2.out
INFO 10:45:05,296 QGraph - -------
INFO 10:45:05,299 QGraph - Pending: java -Xmx2g -Djava.io.tmpdir=/Users/kshakir/src/Sting/tmp -cp "/Users/kshakir/src/Sting/dist/Queue.jar" org.broadinstitute.sting.gatk.CommandLineGATK -T VariantFiltration -L /Users/kshakir/src/Sting/chr20.interval_list -R /Users/kshakir/src/Sting/human_b36_both.fasta -B:vcf,VCF /Users/kshakir/src/Sting/pilot2_daughters.chr20.10k-11k.unfiltered.vcf -o /Users/kshakir/src/Sting/pilot2_daughters.chr20.10k-11k.filtered.vcf -filter SB>=0.10 -filter AB>=0.75 -filter QD<5 -filter HRun>=4 -filterName StrandBias -filterName AlleleBalance -filterName QualByDepth -filterName HomopolymerRun
INFO 10:45:05,299 QGraph - Log: /Users/kshakir/src/Sting/Q-60018@bmef8-d8e-3.out
INFO 10:45:05,302 QGraph - -------
INFO 10:45:05,303 QGraph - Pending: java -Xmx2g -Djava.io.tmpdir=/Users/kshakir/src/Sting/tmp -cp "/Users/kshakir/src/Sting/dist/Queue.jar" org.broadinstitute.sting.gatk.CommandLineGATK -T VariantEval -L /Users/kshakir/src/Sting/chr20.interval_list -R /Users/kshakir/src/Sting/human_b36_both.fasta -B:eval,VCF /Users/kshakir/src/Sting/pilot2_daughters.chr20.10k-11k.filtered.vcf -o /Users/kshakir/src/Sting/pilot2_daughters.chr20.10k-11k.filtered.eval
INFO 10:45:05,303 QGraph - Log: /Users/kshakir/src/Sting/Q-60018@bmef8-d8e-4.out
INFO 10:45:05,304 QGraph - Dry run completed successfully!
INFO 10:45:05,304 QGraph - Re-run with "-run" to execute the functions.
INFO 10:45:05,304 QCommandLine - Done

8. Using traits to pass common values between QScripts to CommandLineFunctions

QScript files often create multiple CommandLineFunctions with similar arguments. Use various scala tricks such as inner classes, traits / mixins, etc. to reuse variables.

  • A self type can be useful to distinguish between this. We use qscript as an alias for the QScript's this to distinguish from the this inside of inner classes or traits.

  • A trait mixin can be used to reuse functionality. The trait below is designed to copy values from the QScript and then is mixed into different instances of the functions.

See the following example:

class MyScript extends org.broadinstitute.sting.queue.QScript {
// Create an alias 'qscript' for 'MyScript.this'
qscript =>

// This is a script argument
@Argument(doc="message to display")
var message: String = _

// This is a script argument
@Argument(doc="number of times to display")
var count: Int = _

trait ReusableArguments extends MyCommandLineFunction {
// Whenever a function is created 'with' this trait, it will copy the message.
this.commandLineMessage = qscript.message
}

abstract class MyCommandLineFunction extends CommandLineFunction {
// This is a per command line argument
@Argument(doc="message to display")
var commandLineMessage: String = _
}

class MyEchoFunction extends MyCommandLineFunction {
def commandLine = "echo " + commandLineMessage
}

class MyAlsoEchoFunction extends MyCommandLineFunction {
def commandLine = "echo also " + commandLineMessage
}

def script = {
for (i <- 1 to count) {
val echo = new MyEchoFunction with ReusableArguments
val alsoEcho = new MyAlsoEchoFunction with ReusableArguments
add(echo, alsoEcho)
}
}
}
Post edited by Geraldine_VdAuwera on

Geraldine Van der Auwera, PhD

Comments

Sign In or Register to comment.