If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra

Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Redirecting tmp files to scratch disk with Queue

Our cluster is setup in such a way that each compute node has a fairly generous scratch disk associated with it. This means that it would be really nice to be able to write temporary files to those scratch disks instead of having to write them over the network to the globally accessible disk. I've been experimenting with different ways of trying to do this using Queue, but so far I've come up short.

Since Queue tries to create all temporary directories (or at least check for their existence) before the jobs are run. I would like to know if there is any way that I could redirect the temporary outputs to the scratch disk using environment variables (To achieve something like this in the end: mycommand -tmp $TMPDIR ...). Since Queue basically just sends commands to be executed on the compute nodes I don't understand why it's necessary to check all temporary dirs before hand. I'm sure I'm missing something here, and I'd like to know what.

From my understanding of Queue I could see two types of temporary files that need to be written to globally accessible disk. The first being the compiled QScript, which location I guess could be set by changing the following piece of code in org.broadinstitute.sting.queue.QCommandLine from:

  private lazy val qScriptPluginManager = {
    qScriptClasses = IOUtils.tempDir("Q-Classes-", "", settings.qSettings.tempDirectory)
    qScriptManager.loadScripts(scripts, qScriptClasses)
    new PluginManager[QScript](qPluginType, Seq(qScriptClasses.toURI.toURL))


  private lazy val qScriptPluginManager = {
    qScriptClasses = IOUtils.tempDir("Q-Classes-", "", "/path/to/globally/accessible/disk")
    qScriptManager.loadScripts(scripts, qScriptClasses)
    new PluginManager[QScript](qPluginType, Seq(qScriptClasses.toURI.toURL))

Of course not hard coded, but feed into the the QCommandLine as an argument, but this is just to illustrate the point.

And the other one would be scatter-gather jobs, which can be explicitly redirected using: --job_scatter_gather_directory, to make sure that all of those end up in a globally accessible part of the file system.

All other temporary files should be possible to write to local non-globally-accessible disk? Or am I completely wrong here, and this is not a path worth pursuing further?



  • pdexheimerpdexheimer Member ✭✭✭✭

    What other temporary files are you thinking of? In my mind, the largest burden comes from the scattered commands.

    Any other temporary files I can think of would be application-specific. Assuming that all (or most) of your jobs are java, I would think you could manipulate the directive. Some googling turned up two possible environment variables (_JAVA_OPTIONS and JAVA_TOOL_OPTIONS) that might let you control every java process launched on the system. I couldn't tell which of those would be more appropriate in a quick look, but one or the other would probably work

  • kshakirkshakir Broadie, Dev ✭✭

    Hi Johan,

    Do you have an example script we could look at, along with expected command lines to be generated?

    Phil's _JAVA_OPTIONS and JAVA_TOOL_OPTIONS may just work too, but I haven't used them myself. As a quick-and-dirty alternative, I attached a hacked ExampleUnifiedGenotyper. The script patches the existing local temporary directory for the UG using a scala mixin.

    That said, a proper fix may involve adding more hooks/variables, such that Queue may be aware how to handle local directories. Because Queue doesn't know anything about /local/scratch in the attached script, among other issues, it won't try and create the local directories.

    I'd be happy to discuss your usages, or review github patches if you have suggestions and test cases.

  • Johan_DahlbergJohan_Dahlberg Member ✭✭✭

    @pdexheimer‌, I'm think e.g. of temporary files generated by MarkDuplicates (for the sorting collections).

    Thanks, @kshakir‌, that looks a lot like what I'm looking for. I've been playing around with this now, and I think I might have a solution, at least for the Picard functions, which was my primary concern. I have a patch for it. Would you prefer that I email the patch to you, or that I submit it as a pull request on github? (I am not allowed to attach it to my post here, so I'm copy pasting it here for you guys to see what I've been thinking of)

    From 1523eb6bd3ed826fdd137bd61a2a3fd7f0a416ae Mon Sep 17 00:00:00 2001
    From: Johan Dahlberg <[email protected]>
    Date: Wed, 9 Apr 2014 14:57:39 +0200
    Subject: [PATCH] Allowing writing tmp files to local scratch
    By overriding the localScratch function with something that does not return None, it is possible to reroute the TMP_DIR output of Picard bam functions (sorting collections and such) to a local scratch disk.
     .../extensions/picard/PicardBamFunction.scala      |    7 +++++--
     1 file changed, 5 insertions(+), 2 deletions(-)
    diff --git a/public/queue-framework/src/main/scala/org/broadinstitute/sting/queue/extensions/picard/PicardBamFunction.scala b/public/queue-framework/src/main/scala/org/broadinstitute/sting/queue/extensions/picard/PicardBamFunction.scala
    index 8fb2147..12d7853 100644
    --- a/public/queue-framework/src/main/scala/org/broadinstitute/sting/queue/extensions/picard/PicardBamFunction.scala
    +++ b/public/queue-framework/src/main/scala/org/broadinstitute/sting/queue/extensions/picard/PicardBamFunction.scala
    @@ -45,13 +45,16 @@ trait PicardBamFunction extends JavaCommandLineFunction {
       var createMD5: Option[Boolean] = None
       var maxRecordsInRam: Option[Int] = None
       var assumeSorted: Option[Boolean] = None
    +  def localScratch: Option[File] = None 
    +  def tmpDir = localScratch.getOrElse(jobTempDir)
       protected def inputBams: Seq[File]
       protected def outputBam: File
       abstract override def commandLine = super.commandLine +
         repeat("INPUT=", inputBams, spaceSeparated=false) +
    -    required("TMP_DIR=" + jobTempDir) +
    +    required("TMP_DIR=" + tmpDir, escape=false) +
         optional("OUTPUT=", outputBam, spaceSeparated=false) +
         optional("COMPRESSION_LEVEL=", compressionLevel, spaceSeparated=false) +
         optional("VALIDATION_STRINGENCY=", validationStringency, spaceSeparated=false) +

    Once I had that in place I was able to do something like this:

      trait LocalScratch extends PicardBamFunction {
        // Do some check to see if local scratch exists
        val localScratchEnvVariable = System.getenv("LOCAL_SCRATCH")
        override def localScratch: Option[File] =
          if (localScratchEnvVariable != null)
            Some(new File(localScratchEnvVariable))
       * Wraps Picard MarkDuplicates
      case class dedup(inBam: File,
                       outBam: File,
                       metricsFile: File,
                       asIntermediate: Boolean = true)
          extends MarkDuplicates with TwoCoreJob with LocalScratch {

    To override the location of the TMP_DIR, I'm testing this now, and I'll report the results back to you. I'm hoping to see some increase in performance, but we'll see if I'm right about that.

  • Johan_DahlbergJohan_Dahlberg Member ✭✭✭

    And I might add that if you are interested in bringing this patch in to the Queue code, I can of course make sure to write some documentation for the functions, but I just wanted to get your opinion before moving on (as well as finish testing it out in my on my data first).

  • pdexheimerpdexheimer Member ✭✭✭✭

    I like this approach, but wonder if it would be worth trying to apply it more broadly. I haven't looked at the relevant Queue code in a while, but perhaps localScratchDir could be defined in JavaCommandLineFunction or even one of it's ancestors. I'm still a little stuck on the idea of defining (some GATK walkers like this), but it would also permit other application-specific definitions

  • Johan_DahlbergJohan_Dahlberg Member ✭✭✭

    @pdexheimer‌, applying this approach higher up in the class hierarchy sounds like a good idea. I just wanted to get something testable in a isolated context as a first step. I can look into how this could be done.

  • Johan_DahlbergJohan_Dahlberg Member ✭✭✭

    Looking around more in the code I found something interesting in QFunction:

       * Local path available on all machines to store LOCAL temporary files. Not an @Input,
       * nor an @Output. Currently only used for local intermediate files for composite jobs.
       * Needs to be an annotated field so that it's mutated during cloning.
      @Argument(doc="Local path available on all machines to store LOCAL temporary files.")
      var jobLocalDir: File = _

    This looks to me a lot like what I've been looking for - maybe @kshakir‌ can confirm that this? Right now this only seems to be used by the ScatterGatherableFunction, but is there any reason this could not also be used by JavaCommandLineFunction, PicardBamFunction, etc to reroute their temporary outputs to scratch?

Sign In or Register to comment.