We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

Drmaa does not report job completion status

brisk022brisk022 University of ZurichMember ✭✭✭

Greetings,

I have a strange problem running Queue 3.3-0 with Open Grid Scheduler 6.2. The job is submitted and executed without any issues but the running thread is never notified about its completion. In particular, running HelloWorld script without any special parameters, e.g. java -Djava.io.tmpdir=tmp -jar $rlib/Queue.jar -S HelloWorld.scala -jobRunner GridEngine -run, results in the following error.

INFO 21:56:30,826 FunctionEdge - Starting: echo hello world INFO 21:56:30,826 FunctionEdge - Output written to /home/rbrisk/tmp/jobrunner/HelloWorld-1.out INFO 21:56:30,852 DrmaaJobRunner - Submitted job id: 16736 INFO 21:56:30,854 QGraph - 0 Pend, 1 Run, 0 Fail, 0 Done WARN 21:57:30,828 DrmaaJobRunner - Unable to determine status of job id 16736 org.ggf.drmaa.InvalidJobException: The job specified by the 'jobid' does not exist. at org.broadinstitute.gatk.utils.jna.drmaa.v1_0.JnaSession.checkError(JnaSession.java:442) at org.broadinstitute.gatk.utils.jna.drmaa.v1_0.JnaSession.checkError(JnaSession.java:392) at org.broadinstitute.gatk.utils.jna.drmaa.v1_0.JnaSession.getJobProgramStatus(JnaSession.java:156) at org.broadinstitute.gatk.queue.engine.drmaa.DrmaaJobRunner.liftedTree1$1(DrmaaJobRunner.scala:105) at org.broadinstitute.gatk.queue.engine.drmaa.DrmaaJobRunner.updateJobStatus(DrmaaJobRunner.scala:104) at org.broadinstitute.gatk.queue.engine.drmaa.DrmaaJobManager$$anonfun$updateStatus$1.apply(DrmaaJobManager.scala:56) at org.broadinstitute.gatk.queue.engine.drmaa.DrmaaJobManager$$anonfun$updateStatus$1.apply(DrmaaJobManager.scala:56) at scala.collection.immutable.Set$Set1.foreach(Set.scala:74) at org.broadinstitute.gatk.queue.engine.drmaa.DrmaaJobManager.updateStatus(DrmaaJobManager.scala:56) at org.broadinstitute.gatk.queue.engine.QGraph$$anonfun$updateStatus$1.apply(QGraph.scala:1128) at org.broadinstitute.gatk.queue.engine.QGraph$$anonfun$updateStatus$1.apply(QGraph.scala:1120) at scala.collection.immutable.List.foreach(List.scala:318) at org.broadinstitute.gatk.queue.engine.QGraph.updateStatus(QGraph.scala:1120) at org.broadinstitute.gatk.queue.engine.QGraph.runJobs(QGraph.scala:468) at org.broadinstitute.gatk.queue.engine.QGraph.run(QGraph.scala:156) at org.broadinstitute.gatk.queue.QCommandLine.execute(QCommandLine.scala:171) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155) at org.broadinstitute.gatk.queue.QCommandLine$.main(QCommandLine.scala:62) at org.broadinstitute.gatk.queue.QCommandLine.main(QCommandLine.scala)
The job completes successfully as the content of HelloWorld-1.out is indeed "hello world" but its status is not reported back to Queue. Is there anything to try apart from digging into the source code?

Thanks!

Roman

Answers

  • ymcymc Member

    I don't know about your scheduler but I am a happy user of Sun Grid Engine 6.2u8

  • brisk022brisk022 University of ZurichMember ✭✭✭

    I can't figure out what GE update it is. It might be 6.2u1. The timestamps of the GE binaries show 2008-08-07.

    I took a session example from drmaa API documentation and ran a hello world script. I did get the successful completion status back in the end when using session's wait method. I also polled the job's status at 100ms intervals until the completion. The job went from QUEUED_ACTIVE to RUNNING to DONE. This makes me think that the problem might be somewhere in the Queue code. It misses the job completion and polls its status after the job has terminated. Either that or there is some problem with the session's synchronise method.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    I'm afraid we can't help you with that particular problem. If you get to the bottom of this, let us know, but this is beyond the scope of support we can provide.

  • brisk022brisk022 University of ZurichMember ✭✭✭
    edited April 2015

    I figured it out. My system locale is set as LC_ALL=uk_UA.UTF8. When Queue's DrmaaJobRunner.scala requests the job's resource usage, org.ggf.drmaa.JobInfo obtains it from the GridEngine, which is locale-aware. It means that start time and end time are reported as a string that looks like 1429866332,0000. That is fine. But then DrmaaJobRunner.scala tries to convert it to Double using StringOps.toDouble, which is not locale-aware. (Besides, GATK's CommandLineProgram.java forces JVM locale to en_US from the start, so you cannot simply patch it by switching to NumberFormat.parse.) As a result, StringOps.toDouble throws NumberFormatException, which is subsequently swallowed in the catch block. Since the exception is thrown after the job has been reaped but before the job is removed from the running job list in QGraph.scala, the job information is no longer available in the GridEngine's Session object but QGraph.scala keeps asking for its status.

    The quick fix would be to run Queue as LC_ALL=C java -jar Queue.jar .... Could you please update the documentation regarding the LC_ALL settings? I did not find any recommendations about it on the forum.

    I also attach a patch for DrmaaJobRunner.scala that fixes the swallowed exception problem. This would greatly facilitate troubleshooting for people whose locale is not en_US or similar and who are not aware of the requirement.

    Issue · Github
    by Geraldine_VdAuwera

    Issue Number
    953
    State
    open
    Last Updated
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Thanks @brisk022 , I'll see what we can do.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
    edited April 2015

    Hi @brisk022, I consulted the devs, and @kshakir had this to say:

    I'd prefer if instead of externally setenv LC_ALL=C, he could find the way of internally setting this variable or setting, just like the CommandLineProgram sets the locale to en_US. That way hopefully there would be no more of these NumberFormatExceptions, and no need for the custom patch inside the DrmaaJobRunner. It may be as simple as calling LibC.setenv("LC_ALL", "C", 1) in the same place as setting the locale.
    While I appreciate his pain trying to debug this issue, I also think back up in the QGraph, we should maybe move all errors up to log level DEBUG, instead of silencing errors. That way if someone encounters something unexpected like this in the future, they may enable DEBUG and maybe get a bit more insight instead of needing to modify the code. I would also accept arguments that this sort of error should be log level WARN, but I'm not sure how prevalent hiccups are, and I'd hope most of the time they are automatically restarted.
    https://github.com/broadinstitute/gsa-unstable/blob/dd4ddcbea55f9903eff0ef3dc2f7e1b1dec456eb/public/gatk-queue/src/main/scala/org/broadinstitute/gatk/queue/engine/QGraph.scala#L1139
    It looks like back when I was putting in this updateStatus() code, it was around the time when LSF was returning random errors now and then, and maybe I was just frustrated?

    (also, he says nice job figuring out the issue!)

  • brisk022brisk022 University of ZurichMember ✭✭✭

    Thanks, Geraldine! I'm currently swamped with other tasks but I'll try it later.

Sign In or Register to comment.