Service note: Geraldine is on vacation this week; other members of GSA will be responding to questions, but they have a lot of work besides this, so be aware that responses may be a little slower than usual. Thank you for your patience.

Job names in Queue using DRMAA job runner

Johan_DahlbergJohan_Dahlberg Posts: 36Member

I've been running Queue using the DRMAA, and I've noticed one thing which I would like to bring up for discussion. The job names are generated using the following code at this point:

 // Set the display name to < 512 characters of the description
  // NOTE: Not sure if this is configuration specific?
  protected val jobNameLength = 500
  protected val jobNameFilter = """[^A-Za-z0-9_]"""
  protected def functionNativeSpec = function.jobNativeArgs.mkString(" ")

  def start() {
    session.synchronized {
      val drmaaJob: JobTemplate = session.createJobTemplate

      drmaaJob.setJobName(function.description.take(jobNameLength).replaceAll(jobNameFilter, "_"))
  [...]

For me this yields names looking something like this:

"__java_____Xmx3072m_____D"

This is not very useful for telling the jobs apart. I'm running my jobs via drmaa on a system using the SLURM resource manager. So the cut-off in the name above can be attributed to the slurm system cutting of the name. Even so, I think that there should be more reasonable ways to create the name - using the function.jobName for example.

So, this leads me to my question - is there any particular reason that the job names are generated the way they are? And if not, do you (the gatk team) want a patch changing this to using the funciton.jobName instead?

Furthermore I would be interested in hearing from other users using gatk queue over drmaa, since I think it might be interesting to develop this further. I have as an example implemented setting a had to implement setting a hard wall time in the jobRunner, since the cluster I'm running on demands this. I'm sure that there are more solutions like that out there, and I would be thrilled to hear about them.

Post edited by Geraldine_VdAuwera on
Tagged:

Comments

  • kshakirkshakir Posts: 3GSA Official Member mod

    The job names as command lines are less and less important for display purposes when checking job status with programs like qstat and bjobs.

    The biggest example is that we even considered using job arrays for submission. At that point if the jobs are all identified in the farm as something like 'queue_job[102]' then the current command line and even .jobName would be impossible to display. Instead Queue could log both the 'queue_job[102]' submission name and the full command line. That would at least allow one to figure out which job is which by referencing the Queue logs.

    That said to answer your question about why the generated names are the way they are, we started Queue as a wrapper around LSF. When you submit a job to LSF bsub takes the command line and makes that the job name up to ~4000 characters. When Queue began development emulating this behavior was very helpful for debugging. As we added GridEngine support we were also able to verify that the ~1000 character job names were what we expected.

    In the meantime I personally often use bsub output to quickly figure out what jobs are running or sometimes suspended in our LSF cluster. I expect that one day with job arrays I will have to take the extra step of going back to the Queue logs or even providing a utility program. Until submitted job names are completely useless though we've been just leaving the short truncated names the way that that they are in GridEngine/DRMAA. But if you have a patch I'd be more than welcome to take a look.

    Re: the hard wall time-- if you have a patch that adds a jobRunLimit to QFunction/QSettings/JobRunners that's a feature present in most farms that we could include as an option for the QScript authors and Queue users. I doubt we would use the functionality but perhaps others will.

    Re: SLURM-- while I can't test the jobRunner like we do with GridEngine and LSF, if you had a extension of the DRMAA JobRunner with specific QFunction/CommandLineFunction mappings that reduced the amount of Queue command line options one had to include I'd be happy to review that patch as well. It could end up benefiting other users who would like to use Queue with SLURM.

    Thanks!

  • TimHughesTimHughes Posts: 15Member

    A colleague of mine has been running the Queue using DRMAA on Condor but had to make some changes to the code to make this possible. In the future we are will be transitioning to SLURM and I would be very interested in the changes Johan has made. In particular, I have heard from the person responsible for the cluster (using SLURM) that the hard wall time could be an issue for him.

    I was also wondering, are you using GATK 1.x or 2.x and will queue remain open source in 2.x?

  • Johan_DahlbergJohan_Dahlberg Posts: 36Member

    I wrote the staring post of this discussion, read kshakirs answer, got other priorities and then the whole thing sort of slipped my mind.

    On the subject of the job names I have written a patch for the job names (however only for the DrmaaJobRunner), but I haven't had time to test it properly yet. I will get back to you once I have.

    Concerning the QFunction/CommandLineFuncion mappings I will make sure to collect my changes into a patch there as well. Since I have done this by looking at the existing code and trying to replicate it I'm sure that there might be some stuff in there that are not fully up to standards - so any help reviewing that would be much appreciated.

    @TimHughes I would be happy to share my experience of running Queue with SLURM/DRMAA, and of course share any code I have. If you want to dive straight in you can checkout my gatk fork at https://github.com/johandahlberg/gatk/tree/devel - note that you need to look at the devel branch, since the master branch is just my copy of the main gatk repo.

    I'm not sure if the last question was aimed at me, or the gatk team. But assuming that you were wondering what I've been using, I'm using GATK Lite 2.x (if I've got the terminology right), the open source version of the current source code. Furthermore if I've understood things correctly queue will stay open source in the future...

  • Mark_DePristoMark_DePristo Posts: 132Administrator, GSA Official Member admin

    I'd be very much interested in incorporating patches to Queue for any other job execution engines, so please do contribute. We intend -- like with the GATK -- that the framework itself will remain open source, so that anyone can use it to run their own scripts, but that potentially some (to be fair, this is currently none) scripts would be premium tools put into the full release only.

    -- Mark A. DePristo, Ph.D. Co-Director, Medical and Population Genetics Broad Institute of MIT and Harvard

  • Johan_DahlbergJohan_Dahlberg Posts: 36Member

    Now I've gotten around to formatting my patch for the job walltime in the drmaa jobrunner. I tried to attach it to the post, but the format wasn't allowed, so I'm pasting it below. I can of course also send it by email if any ones interested. Any comments are very welcome, there may be a lot of better ways to achieve this end, and if so I would be happy to hear about them.

    From c696ecf2d36b524e1842d67f54c67961546967aa Mon Sep 17 00:00:00 2001
    From: Johan Dahlberg <johan.dahlberg@medsci.uu.se>
    Date: Fri, 28 Sep 2012 14:56:08 +0200
    Subject: [PATCH] Setting the walltime in the Drmaa jobrunner
    
    ---
     .../org/broadinstitute/sting/queue/QSettings.scala |    4 ++++
     .../sting/queue/engine/drmaa/DrmaaJobRunner.scala  |    3 +++
     .../sting/queue/function/CommandLineFunction.scala |   10 ++++++++++
     3 files changed, 17 insertions(+)
    
    diff --git a/public/scala/src/org/broadinstitute/sting/queue/QSettings.scala b/public/scala/src/org/broadinstitute/sting/queue/QSettings.scala
    index 1a50301..bae3bde 100644
    --- a/public/scala/src/org/broadinstitute/sting/queue/QSettings.scala
    +++ b/public/scala/src/org/broadinstitute/sting/queue/QSettings.scala
    @@ -31,6 +31,10 @@ import org.broadinstitute.sting.commandline.Argument
      * Default settings settable on the command line and passed to CommandLineFunctions.
      */
     class QSettings {
    +  
    +  @Argument(fullName="job_walltime", shortName="wallTime", doc="Setting the required walltime when using the drmaa job runner.", required=false)
    +  var jobWalltime: Option[Long] = None
    +  
       @Argument(fullName="run_name", shortName="runName", doc="A name for this run used for various status messages.", required=false)
       var runName: String = _
    
    diff --git a/public/scala/src/org/broadinstitute/sting/queue/engine/drmaa/DrmaaJobRunner.scala b/public/scala/src/org/broadinstitute/sting/queue/engine/drmaa/DrmaaJobRunner.scala
    index 2aae2fc..31b314c 100644
    --- a/public/scala/src/org/broadinstitute/sting/queue/engine/drmaa/DrmaaJobRunner.scala
    +++ b/public/scala/src/org/broadinstitute/sting/queue/engine/drmaa/DrmaaJobRunner.scala
    @@ -65,6 +65,9 @@ class DrmaaJobRunner(val session: Session, val function: CommandLineFunction) ex
             drmaaJob.setJoinFiles(true)
           }
    
    +      if(function.wallTime != null)
    +         drmaaJob.setHardWallclockTimeLimit(function.wallTime.get)      
    +      
           drmaaJob.setNativeSpecification(functionNativeSpec)
    
           // Instead of running the function.commandLine, run "sh <jobScript>"
    diff --git a/public/scala/src/org/broadinstitute/sting/queue/function/CommandLineFunction.scala b/public/scala/src/org/broadinstitute/sting/queue/function/CommandLineFunction.scala
    index 84b6257..66e51b3 100644
    --- a/public/scala/src/org/broadinstitute/sting/queue/function/CommandLineFunction.scala
    +++ b/public/scala/src/org/broadinstitute/sting/queue/function/CommandLineFunction.scala
    @@ -32,6 +32,9 @@ import org.broadinstitute.sting.queue.util._
     trait CommandLineFunction extends QFunction with Logging {
       def commandLine: String
    
    +  /** Setting the wall time request for drmaa job*/
    +  var wallTime: Option[Long] = None
    +  
       /** Upper memory limit */
       var memoryLimit: Option[Double] = None
    
    @@ -63,6 +66,9 @@ trait CommandLineFunction extends QFunction with Logging {
         super.copySettingsTo(function)
         function match {
           case commandLineFunction: CommandLineFunction =>
    +        if(commandLineFunction.wallTime.isEmpty)
    +          commandLineFunction.wallTime = this.wallTime
    +        
             if (commandLineFunction.memoryLimit.isEmpty)
               commandLineFunction.memoryLimit = this.memoryLimit
    
    @@ -106,6 +112,10 @@ trait CommandLineFunction extends QFunction with Logging {
        * Sets all field values.
        */
       override def freezeFieldValues() {
    +   
    +    if(wallTime.isEmpty)
    +      wallTime = qSettings.jobWalltime
    +    
         if (jobQueue == null)
           jobQueue = qSettings.jobQueue
    
    -- 
    1.7.9.5
    
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 2,238Administrator, GSA Official Member admin

    Hi Johan, thanks for sharing this! We'll have a look and see if we can add this to the codebase. To that end, could you please check that your patch conforms to the patch submission instructions then email it to me at vdauwera@broadinstitute.org? Thanks!

    Geraldine Van der Auwera, PhD

  • Johan_DahlbergJohan_Dahlberg Posts: 36Member

    I think that it does follow the guidelines, but if it does not, please tell me and I will try to fix it. I have emailed you the patch now.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 2,238Administrator, GSA Official Member admin

    Hi Johan, I'm glad to report that we've finally got around to integrating your walltime patch into the codebase! It will be available in the next release (2.3). Thanks for your contribution!

    Geraldine Van der Auwera, PhD

    Johan_Dahlberg
Sign In or Register to comment.