The current GATK version is 3.7-0
Examples: Monday, today, last week, Mar 26, 3/26/04

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

#### ☞ Did you remember to?

1. Search using the upper-right search box, e.g. using the error message.
3. Include tool and Java versions.
4. Tell us whether you are following GATK Best Practices.
5. Include relevant details, e.g. platform, DNA- or RNA-Seq, WES (+capture kit) or WGS (PCR-free or PCR+), paired- or single-end, read length, expected average coverage, somatic data, etc.
6. For tool errors, include the error stacktrace as well as the exact command.
7. For format issues, include the result of running ValidateSamFile for BAMs or ValidateVariants for VCFs.
8. For weird results, include an illustrative example, e.g. attach IGV screenshots according to Article#5484.
9. For a seeming variant that is uncalled, include results of following Article#1235.

#### ☞ Formatting tip!

Surround blocks of code, error messages and BAM/VCF snippets--especially content with hashes (#)--with lines with three backticks (  ) each to make a code block.
GATK 3.7 is here! Be sure to read the Version Highlights and optionally the full Release Notes.

# Redirecting tmp files to scratch disk with Queue

Member Posts: 96 ✭✭✭

Our cluster is setup in such a way that each compute node has a fairly generous scratch disk associated with it. This means that it would be really nice to be able to write temporary files to those scratch disks instead of having to write them over the network to the globally accessible disk. I've been experimenting with different ways of trying to do this using Queue, but so far I've come up short.

Since Queue tries to create all temporary directories (or at least check for their existence) before the jobs are run. I would like to know if there is any way that I could redirect the temporary outputs to the scratch disk using environment variables (To achieve something like this in the end: mycommand -tmp \$TMPDIR ...). Since Queue basically just sends commands to be executed on the compute nodes I don't understand why it's necessary to check all temporary dirs before hand. I'm sure I'm missing something here, and I'd like to know what.

From my understanding of Queue I could see two types of temporary files that need to be written to globally accessible disk. The first being the compiled QScript, which location I guess could be set by changing the following piece of code in org.broadinstitute.sting.queue.QCommandLine from:

  private lazy val qScriptPluginManager = {
qScriptClasses = IOUtils.tempDir("Q-Classes-", "", settings.qSettings.tempDirectory)
new PluginManager[QScript](qPluginType, Seq(qScriptClasses.toURI.toURL))
}


to:

  private lazy val qScriptPluginManager = {
qScriptClasses = IOUtils.tempDir("Q-Classes-", "", "/path/to/globally/accessible/disk")
new PluginManager[QScript](qPluginType, Seq(qScriptClasses.toURI.toURL))
}


Of course not hard coded, but feed into the the QCommandLine as an argument, but this is just to illustrate the point.

And the other one would be scatter-gather jobs, which can be explicitly redirected using: --job_scatter_gather_directory, to make sure that all of those end up in a globally accessible part of the file system.

All other temporary files should be possible to write to local non-globally-accessible disk? Or am I completely wrong here, and this is not a path worth pursuing further?

Tagged:

• Member, Dev Posts: 544 ✭✭✭✭

What other temporary files are you thinking of? In my mind, the largest burden comes from the scattered commands.

Any other temporary files I can think of would be application-specific. Assuming that all (or most) of your jobs are java, I would think you could manipulate the -Djava.io.tmpdir directive. Some googling turned up two possible environment variables (_JAVA_OPTIONS and JAVA_TOOL_OPTIONS) that might let you control every java process launched on the system. I couldn't tell which of those would be more appropriate in a quick look, but one or the other would probably work

• Dev Posts: 29 ✭✭

Hi Johan,

Do you have an example script we could look at, along with expected command lines to be generated?

Phil's _JAVA_OPTIONS and JAVA_TOOL_OPTIONS may just work too, but I haven't used them myself. As a quick-and-dirty alternative, I attached a hacked ExampleUnifiedGenotyper. The script patches the existing local temporary directory for the UG using a scala mixin.

That said, a proper fix may involve adding more hooks/variables, such that Queue may be aware how to handle local directories. Because Queue doesn't know anything about /local/scratch in the attached script, among other issues, it won't try and create the local directories.

I'd be happy to discuss your usages, or review github patches if you have suggestions and test cases.

• Member Posts: 96 ✭✭✭

@pdexheimer‌, I'm think e.g. of temporary files generated by MarkDuplicates (for the sorting collections).

Thanks, @kshakir‌, that looks a lot like what I'm looking for. I've been playing around with this now, and I think I might have a solution, at least for the Picard functions, which was my primary concern. I have a patch for it. Would you prefer that I email the patch to you, or that I submit it as a pull request on github? (I am not allowed to attach it to my post here, so I'm copy pasting it here for you guys to see what I've been thinking of)

From 1523eb6bd3ed826fdd137bd61a2a3fd7f0a416ae Mon Sep 17 00:00:00 2001
From: Johan Dahlberg <johan.dahlberg@medsci.uu.se>
Date: Wed, 9 Apr 2014 14:57:39 +0200
Subject: [PATCH] Allowing writing tmp files to local scratch

By overriding the localScratch function with something that does not return None, it is possible to reroute the TMP_DIR output of Picard bam functions (sorting collections and such) to a local scratch disk.
---
.../extensions/picard/PicardBamFunction.scala      |    7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)

index 8fb2147..12d7853 100644
@@ -45,13 +45,16 @@ trait PicardBamFunction extends JavaCommandLineFunction {
var createMD5: Option[Boolean] = None
var maxRecordsInRam: Option[Int] = None
var assumeSorted: Option[Boolean] = None
-
+
+  def localScratch: Option[File] = None
+  def tmpDir = localScratch.getOrElse(jobTempDir)
+
protected def inputBams: Seq[File]
protected def outputBam: File

abstract override def commandLine = super.commandLine +
repeat("INPUT=", inputBams, spaceSeparated=false) +
-    required("TMP_DIR=" + jobTempDir) +
+    required("TMP_DIR=" + tmpDir, escape=false) +
optional("OUTPUT=", outputBam, spaceSeparated=false) +
optional("COMPRESSION_LEVEL=", compressionLevel, spaceSeparated=false) +
optional("VALIDATION_STRINGENCY=", validationStringency, spaceSeparated=false) +
--
1.7.9.5


Once I had that in place I was able to do something like this:

  trait LocalScratch extends PicardBamFunction {
// Do some check to see if local scratch exists
val localScratchEnvVariable = System.getenv("LOCAL_SCRATCH")

override def localScratch: Option[File] =
if (localScratchEnvVariable != null)
Some(new File(localScratchEnvVariable))
else
None
}

/**
* Wraps Picard MarkDuplicates
*/
case class dedup(inBam: File,
outBam: File,
metricsFile: File,
asIntermediate: Boolean = true)
extends MarkDuplicates with TwoCoreJob with LocalScratch {


To override the location of the TMP_DIR, I'm testing this now, and I'll report the results back to you. I'm hoping to see some increase in performance, but we'll see if I'm right about that.

• Member Posts: 96 ✭✭✭

And I might add that if you are interested in bringing this patch in to the Queue code, I can of course make sure to write some documentation for the functions, but I just wanted to get your opinion before moving on (as well as finish testing it out in my on my data first).

• Member, Dev Posts: 544 ✭✭✭✭

I like this approach, but wonder if it would be worth trying to apply it more broadly. I haven't looked at the relevant Queue code in a while, but perhaps localScratchDir could be defined in JavaCommandLineFunction or even one of it's ancestors. I'm still a little stuck on the idea of defining java.io.tmpdir (some GATK walkers like this), but it would also permit other application-specific definitions

• Member Posts: 96 ✭✭✭

@pdexheimer‌, applying this approach higher up in the class hierarchy sounds like a good idea. I just wanted to get something testable in a isolated context as a first step. I can look into how this could be done.

• Member Posts: 96 ✭✭✭

Looking around more in the code I found something interesting in QFunction:

  /**
* Local path available on all machines to store LOCAL temporary files. Not an @Input,
* nor an @Output. Currently only used for local intermediate files for composite jobs.
* Needs to be an annotated field so that it's mutated during cloning.
*/
@Argument(doc="Local path available on all machines to store LOCAL temporary files.")
var jobLocalDir: File = _


This looks to me a lot like what I've been looking for - maybe @kshakir‌ can confirm that this? Right now this only seems to be used by the ScatterGatherableFunction, but is there any reason this could not also be used by JavaCommandLineFunction, PicardBamFunction`, etc to reroute their temporary outputs to scratch?