Redirecting tmp files to scratch disk with Queue

Johan_DahlbergJohan_Dahlberg Posts: 85Member ✭✭✭

Our cluster is setup in such a way that each compute node has a fairly generous scratch disk associated with it. This means that it would be really nice to be able to write temporary files to those scratch disks instead of having to write them over the network to the globally accessible disk. I've been experimenting with different ways of trying to do this using Queue, but so far I've come up short.

Since Queue tries to create all temporary directories (or at least check for their existence) before the jobs are run. I would like to know if there is any way that I could redirect the temporary outputs to the scratch disk using environment variables (To achieve something like this in the end: mycommand -tmp $TMPDIR ...). Since Queue basically just sends commands to be executed on the compute nodes I don't understand why it's necessary to check all temporary dirs before hand. I'm sure I'm missing something here, and I'd like to know what.

From my understanding of Queue I could see two types of temporary files that need to be written to globally accessible disk. The first being the compiled QScript, which location I guess could be set by changing the following piece of code in org.broadinstitute.sting.queue.QCommandLine from:

  private lazy val qScriptPluginManager = {
    qScriptClasses = IOUtils.tempDir("Q-Classes-", "", settings.qSettings.tempDirectory)
    qScriptManager.loadScripts(scripts, qScriptClasses)
    new PluginManager[QScript](qPluginType, Seq(qScriptClasses.toURI.toURL))
  }

to:

  private lazy val qScriptPluginManager = {
    qScriptClasses = IOUtils.tempDir("Q-Classes-", "", "/path/to/globally/accessible/disk")
    qScriptManager.loadScripts(scripts, qScriptClasses)
    new PluginManager[QScript](qPluginType, Seq(qScriptClasses.toURI.toURL))
  }

Of course not hard coded, but feed into the the QCommandLine as an argument, but this is just to illustrate the point.

And the other one would be scatter-gather jobs, which can be explicitly redirected using: --job_scatter_gather_directory, to make sure that all of those end up in a globally accessible part of the file system.

All other temporary files should be possible to write to local non-globally-accessible disk? Or am I completely wrong here, and this is not a path worth pursuing further?

Tagged:

Answers

  • pdexheimerpdexheimer Posts: 387Member, GSA Collaborator ✭✭✭✭

    What other temporary files are you thinking of? In my mind, the largest burden comes from the scattered commands.

    Any other temporary files I can think of would be application-specific. Assuming that all (or most) of your jobs are java, I would think you could manipulate the -Djava.io.tmpdir directive. Some googling turned up two possible environment variables (_JAVA_OPTIONS and JAVA_TOOL_OPTIONS) that might let you control every java process launched on the system. I couldn't tell which of those would be more appropriate in a quick look, but one or the other would probably work

  • kshakirkshakir Posts: 22GATK Developer mod

    Hi Johan,

    Do you have an example script we could look at, along with expected command lines to be generated?

    Phil's _JAVA_OPTIONS and JAVA_TOOL_OPTIONS may just work too, but I haven't used them myself. As a quick-and-dirty alternative, I attached a hacked ExampleUnifiedGenotyper. The script patches the existing local temporary directory for the UG using a scala mixin.

    That said, a proper fix may involve adding more hooks/variables, such that Queue may be aware how to handle local directories. Because Queue doesn't know anything about /local/scratch in the attached script, among other issues, it won't try and create the local directories.

    I'd be happy to discuss your usages, or review github patches if you have suggestions and test cases.

    txt
    txt
    ExampleUnifiedGenotyperLocalTempDir.scala.txt
    1K
  • Johan_DahlbergJohan_Dahlberg Posts: 85Member ✭✭✭

    @pdexheimer‌, I'm think e.g. of temporary files generated by MarkDuplicates (for the sorting collections).

    Thanks, @kshakir‌, that looks a lot like what I'm looking for. I've been playing around with this now, and I think I might have a solution, at least for the Picard functions, which was my primary concern. I have a patch for it. Would you prefer that I email the patch to you, or that I submit it as a pull request on github? (I am not allowed to attach it to my post here, so I'm copy pasting it here for you guys to see what I've been thinking of)

    From 1523eb6bd3ed826fdd137bd61a2a3fd7f0a416ae Mon Sep 17 00:00:00 2001
    From: Johan Dahlberg <johan.dahlberg@medsci.uu.se>
    Date: Wed, 9 Apr 2014 14:57:39 +0200
    Subject: [PATCH] Allowing writing tmp files to local scratch
    
    By overriding the localScratch function with something that does not return None, it is possible to reroute the TMP_DIR output of Picard bam functions (sorting collections and such) to a local scratch disk.
    ---
     .../extensions/picard/PicardBamFunction.scala      |    7 +++++--
     1 file changed, 5 insertions(+), 2 deletions(-)
    
    diff --git a/public/queue-framework/src/main/scala/org/broadinstitute/sting/queue/extensions/picard/PicardBamFunction.scala b/public/queue-framework/src/main/scala/org/broadinstitute/sting/queue/extensions/picard/PicardBamFunction.scala
    index 8fb2147..12d7853 100644
    --- a/public/queue-framework/src/main/scala/org/broadinstitute/sting/queue/extensions/picard/PicardBamFunction.scala
    +++ b/public/queue-framework/src/main/scala/org/broadinstitute/sting/queue/extensions/picard/PicardBamFunction.scala
    @@ -45,13 +45,16 @@ trait PicardBamFunction extends JavaCommandLineFunction {
       var createMD5: Option[Boolean] = None
       var maxRecordsInRam: Option[Int] = None
       var assumeSorted: Option[Boolean] = None
    -
    +  
    +  def localScratch: Option[File] = None 
    +  def tmpDir = localScratch.getOrElse(jobTempDir)
    +  
       protected def inputBams: Seq[File]
       protected def outputBam: File
    
       abstract override def commandLine = super.commandLine +
         repeat("INPUT=", inputBams, spaceSeparated=false) +
    -    required("TMP_DIR=" + jobTempDir) +
    +    required("TMP_DIR=" + tmpDir, escape=false) +
         optional("OUTPUT=", outputBam, spaceSeparated=false) +
         optional("COMPRESSION_LEVEL=", compressionLevel, spaceSeparated=false) +
         optional("VALIDATION_STRINGENCY=", validationStringency, spaceSeparated=false) +
    -- 
    1.7.9.5
    

    Once I had that in place I was able to do something like this:

      trait LocalScratch extends PicardBamFunction {
        // Do some check to see if local scratch exists
        val localScratchEnvVariable = System.getenv("LOCAL_SCRATCH")
    
        override def localScratch: Option[File] =
          if (localScratchEnvVariable != null)
            Some(new File(localScratchEnvVariable))
          else
            None
      }
    
      /**
       * Wraps Picard MarkDuplicates
       */
      case class dedup(inBam: File,
                       outBam: File,
                       metricsFile: File,
                       asIntermediate: Boolean = true)
          extends MarkDuplicates with TwoCoreJob with LocalScratch {
    

    To override the location of the TMP_DIR, I'm testing this now, and I'll report the results back to you. I'm hoping to see some increase in performance, but we'll see if I'm right about that.

  • Johan_DahlbergJohan_Dahlberg Posts: 85Member ✭✭✭

    And I might add that if you are interested in bringing this patch in to the Queue code, I can of course make sure to write some documentation for the functions, but I just wanted to get your opinion before moving on (as well as finish testing it out in my on my data first).

  • pdexheimerpdexheimer Posts: 387Member, GSA Collaborator ✭✭✭✭

    I like this approach, but wonder if it would be worth trying to apply it more broadly. I haven't looked at the relevant Queue code in a while, but perhaps localScratchDir could be defined in JavaCommandLineFunction or even one of it's ancestors. I'm still a little stuck on the idea of defining java.io.tmpdir (some GATK walkers like this), but it would also permit other application-specific definitions

  • Johan_DahlbergJohan_Dahlberg Posts: 85Member ✭✭✭

    @pdexheimer‌, applying this approach higher up in the class hierarchy sounds like a good idea. I just wanted to get something testable in a isolated context as a first step. I can look into how this could be done.

  • Johan_DahlbergJohan_Dahlberg Posts: 85Member ✭✭✭

    Looking around more in the code I found something interesting in QFunction:

      /**
       * Local path available on all machines to store LOCAL temporary files. Not an @Input,
       * nor an @Output. Currently only used for local intermediate files for composite jobs.
       * Needs to be an annotated field so that it's mutated during cloning.
       */
      @Argument(doc="Local path available on all machines to store LOCAL temporary files.")
      var jobLocalDir: File = _
    

    This looks to me a lot like what I've been looking for - maybe @kshakir‌ can confirm that this? Right now this only seems to be used by the ScatterGatherableFunction, but is there any reason this could not also be used by JavaCommandLineFunction, PicardBamFunction, etc to reroute their temporary outputs to scratch?

Sign In or Register to comment.