We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

Queue not making progress (after initial jobs released)

fromerfromer NYCMember ✭✭


I've started running Queue recently using the latest code but now using GridEngine.

SInce that time, I've noticed the following behavior: I start a Queue Script, which looks at the GridEngine queue I ask for and decides how many jobs to release and run, which it then does. But, it never seems to add more jobs to that (even after all of the original jobs are done). Even waiting a few hours to days the Queue script process is just waiting for something and never releasing more jobs.

Note that this is a big Queue job as the initial calculation takes 2 hours, but in the past it’s been the case that jobs get continuously released as old jobs finish.

Here’s an example timeline:

INFO 14:19:09,415 QCommandLine - Scripting xhmmCNVpipeline

INFO 14:20:29,551 QCommandLine - Added 8171 functions

INFO 14:20:29,553 QGraph - Generating graph.

INFO 14:25:37,447 QGraph - Generating scatter gather jobs.

INFO 14:26:54,393 QGraph - Removing original jobs.

INFO 14:26:55,190 QGraph - Adding scatter gather jobs.

INFO 14:43:05,534 QGraph - Regenerating graph.

INFO 16:22:01,645 QGraph - Running jobs.

INFO 16:24:51,242 QGraph - 23445 Pend, 374 Run, 0 Fail, 25137 Done

INFO 18:58:38,339 QGraph - 23445 Pend, 370 Run, 0 Fail, 25141 Done

… …

INFO 05:41:35,679 QGraph - 23445 Pend, 2 Run, 0 Fail, 25509 Done

INFO 05:42:27,330 QGraph - 23445 Pend, 0 Run, 0 Fail, 25511 Done

Hours later, I gave up and killed the java Queue process, since it was not releasing any new jobs. RIght before that, I ran jstack on that process and have copied it below.

Has anyone else seen this behavior?

Is there any way to nudge Queue to release more jobs into the queueing system? Either initially, along the way (while some jobs have finished and some are still running), or when it's just waiting with 0 jobs?

Thanks, Menachem

2014-12-15 10:59:13 Full thread dump Java HotSpot(TM) 64-Bit Server VM (24.51-b03 mixed mode):

"Attach Listener" daemon prio=10 tid=0x0000000009a6f000 nid=0x2169 waiting on condition [0x0000000000000000] java.lang.Thread.State: RUNNABLE

"Service Thread" daemon prio=10 tid=0x00002b6714029000 nid=0x1e72 runnable [0x0000000000000000] java.lang.Thread.State: RUNNABLE

"C2 CompilerThread1" daemon prio=10 tid=0x00002b6714026800 nid=0x1e71 waiting on condition [0x0000000000000000] java.lang.Thread.State: RUNNABLE

"C2 CompilerThread0" daemon prio=10 tid=0x00002b6714023800 nid=0x1e70 waiting on condition [0x0000000000000000] java.lang.Thread.State: RUNNABLE

"Signal Dispatcher" daemon prio=10 tid=0x00002b6714021800 nid=0x1e6f runnable [0x0000000000000000] java.lang.Thread.State: RUNNABLE

"Finalizer" daemon prio=10 tid=0x00002b6714003000 nid=0x1e6e in Object.wait() [0x00002b672e17d000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on <0x0000000600011eb0> (a java.lang.ref.ReferenceQueue$Lock) at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:135) - locked <0x0000000600011eb0> (a java.lang.ref.ReferenceQueue$Lock) at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:151) at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:189)

"Reference Handler" daemon prio=10 tid=0x000000000999c800 nid=0x1e6d in Object.wait() [0x00002b672e07c000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on <0x000000060000e5c0> (a java.lang.ref.Reference$Lock) at java.lang.Object.wait(Object.java:503) at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:133) - locked <0x000000060000e5c0> (a java.lang.ref.Reference$Lock)

"main" prio=10 tid=0x00000000095ea000 nid=0x1e54 runnable [0x00002b670cd63000] java.lang.Thread.State: RUNNABLE at scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:48) at scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:48) at scala.collection.immutable.List.foreach(List.scala:318) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:176) at scala.collection.immutable.List.$colon$colon$colon(List.scala:127) at scala.collection.immutable.List.$plus$plus(List.scala:193) at org.broadinstitute.gatk.queue.engine.QGraph$$anonfun$org$broadinstitute$gatk$queue$engine$QGraph$$previousFunctions$1.apply(QGraph.scala:230) at org.broadinstitute.gatk.queue.engine.QGraph$$anonfun$org$broadinstitute$gatk$queue$engine$QGraph$$previousFunctions$1.apply(QGraph.scala:223) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at scala.collection.AbstractIterable.foreach(Iterable.scala:54) at org.broadinstitute.gatk.queue.engine.QGraph.org$broadinstitute$gatk$queue$engine$QGraph$$previousFunctions(QGraph.scala:223) at org.broadinstitute.gatk.queue.engine.QGraph$$anonfun$getReadyJobs$1.apply(QGraph.scala:285) at org.broadinstitute.gatk.queue.engine.QGraph$$anonfun$getReadyJobs$1.apply(QGraph.scala:283) at scala.collection.TraversableLike$$anonfun$filter$1.apply(TraversableLike.scala:264) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at scala.collection.AbstractIterable.foreach(Iterable.scala:54) at scala.collection.TraversableLike$class.filter(TraversableLike.scala:263) at scala.collection.AbstractTraversable.filter(Traversable.scala:105) at org.broadinstitute.gatk.queue.engine.QGraph.getReadyJobs(QGraph.scala:283) at org.broadinstitute.gatk.queue.engine.QGraph.runJobs(QGraph.scala:508) at org.broadinstitute.gatk.queue.engine.QGraph.run(QGraph.scala:156) - locked <0x0000000600008190> (a java.lang.Object) at org.broadinstitute.gatk.queue.QCommandLine.execute(QCommandLine.scala:170) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155) at org.broadinstitute.gatk.queue.QCommandLine$.main(QCommandLine.scala:61) at org.broadinstitute.gatk.queue.QCommandLine.main(QCommandLine.scala)

"VM Thread" prio=10 tid=0x000000000999a800 nid=0x1e6c runnable

"GC task thread#0 (ParallelGC)" prio=10 tid=0x0000000009600000 nid=0x1e55 runnable

"GC task thread#1 (ParallelGC)" prio=10 tid=0x0000000009602000 nid=0x1e56 runnable

"GC task thread#2 (ParallelGC)" prio=10 tid=0x0000000009604000 nid=0x1e57 runnable

"GC task thread#3 (ParallelGC)" prio=10 tid=0x0000000009606000 nid=0x1e58 runnable

"GC task thread#4 (ParallelGC)" prio=10 tid=0x0000000009607800 nid=0x1e59 runnable

"GC task thread#5 (ParallelGC)" prio=10 tid=0x0000000009609800 nid=0x1e5a runnable

"GC task thread#6 (ParallelGC)" prio=10 tid=0x000000000960b800 nid=0x1e5b runnable

"GC task thread#7 (ParallelGC)" prio=10 tid=0x000000000960d000 nid=0x1e5c runnable

"GC task thread#8 (ParallelGC)" prio=10 tid=0x000000000960f000 nid=0x1e5d runnable

"GC task thread#9 (ParallelGC)" prio=10 tid=0x0000000009611000 nid=0x1e5e runnable

"GC task thread#10 (ParallelGC)" prio=10 tid=0x0000000009613000 nid=0x1e5f runnable

"GC task thread#11 (ParallelGC)" prio=10 tid=0x0000000009614800 nid=0x1e60 runnable

"GC task thread#12 (ParallelGC)" prio=10 tid=0x0000000009616800 nid=0x1e61 runnable

"GC task thread#13 (ParallelGC)" prio=10 tid=0x0000000009618800 nid=0x1e62 runnable

"GC task thread#14 (ParallelGC)" prio=10 tid=0x000000000961a000 nid=0x1e63 runnable

"GC task thread#15 (ParallelGC)" prio=10 tid=0x000000000961c000 nid=0x1e64 runnable

"GC task thread#16 (ParallelGC)" prio=10 tid=0x000000000961e000 nid=0x1e65 runnable

"GC task thread#17 (ParallelGC)" prio=10 tid=0x0000000009620000 nid=0x1e66 runnable

"GC task thread#18 (ParallelGC)" prio=10 tid=0x0000000009621800 nid=0x1e67 runnable

"GC task thread#19 (ParallelGC)" prio=10 tid=0x0000000009623800 nid=0x1e68 runnable

"GC task thread#20 (ParallelGC)" prio=10 tid=0x0000000009625800 nid=0x1e69 runnable

"GC task thread#21 (ParallelGC)" prio=10 tid=0x0000000009627000 nid=0x1e6a runnable

"GC task thread#22 (ParallelGC)" prio=10 tid=0x0000000009629000 nid=0x1e6b runnable

"VM Periodic Task Thread" prio=10 tid=0x00002b6714034000 nid=0x1e73 waiting on condition

JNI global references: 393


  • fromerfromer NYCMember ✭✭

    From using jstack, it seems like the script is iterating over the following lines in the runJobs() function in public/gatk-queue/src/main/scala/org/broadinstitute/gatk/queue/engine/QGraph.scala:

    readyJobs ++= getReadyJobs



    It's almost as if it's not getting the correct job statuses from Grid Engine perhaps since it keeps running these but never progresses?

    FYI, the stuck process ID (on gsa3) is 25664 if someone wants to take a look.


  • fromerfromer NYCMember ✭✭

    I've tried switching back to bsub (LSF) and this issue still seems to persist. So perhaps it's not a GridEngine issue after all?

    Has anyone experienced this before?
    Queue releases a first batch of jobs and knows when they're done, but never releases any more - either as the original jobs finish or even when they're all done.

  • pdexheimerpdexheimer Member ✭✭✭✭

    My Queue experience has entirely been in LSF, but I've never encountered this issue. Could it be that some of your prerequisite jobs are exiting with an abnormal status? That would lead to a different status being sent back to Queue (the difference between DONE and EXIT in LSF), and I believe that getReadyJobs considers jobs that EXITed to not satisfy the prerequisites. I don't know if Queue will recognize an "inaccessible" branch of the DAG and terminate, or if it will do what you're seeing and just sit idle forever

  • kshakirkshakir Broadie, Dev ✭✭✭

    For debugging purposes, can you run a version of this script with a smaller number of inputs, on the order of 5 or 50, instead of 50 thousand? It would help figure out what's going on at this point in the script.

    Based on the stack dump at that singular point in time, the code appears to be inside the recursive function QGraph.previousFunctions().

    However, the fact that the engine is winding down to 0 Run jobs makes it also appear that the Queue graph is not wedged in an infinite loop of recursive code. Queue's graph uses an event loop, and the program in general doesn't use too many background threads.

    I'd recommend trying one or more of the following, preferably with much smaller test case (50K jobs will produces an inappropriate amount of graph debug output):

    • Run Queue with -l DEBUG, without -run first, then with -run. In the debug output we're looking to make sure that the each next job you expect to run has a dependency on the previous output. Dependencies are visible via Inputs, Outputs, Done+ and Done- in the debug outputs.
    • Run Queue listening for a debugger, then after the process gets stuck again, attach IntelliJ IDEA to take a look at the running process. With IntelliJ it should be relatively easy to see on a small dataset why Queue thinks there are pending jobs but does not think they are ready to run.
    • Print out the job graph using -gv qgraph.dot and open it in a DOT compatible program like OmniGraffle. With a small job graph, we're looking to see if we can visualize some sort of unexpected sectioning in the graph, especially around the jobs that are pending but never start running.
  • fromerfromer NYCMember ✭✭

    OK, thanks for your detailed advice.

    I've run a few sets of toy examples that encompass fewer jobs (between 50 and 5000).
    Starting each of these from scratch (both in dry run mode and actually running them with -run), they each properly finish to completion making the natural progress of releasing jobs as previous ones are completed.

    I now realize that what in fact may be the problem is that, as a developer who has worked more extensively with Makefiles, I added some final commands to the Qscript and expected the large run of tens of thousands of jobs to continue from where it had left off. That is, I expected it build the latter (ie, downstream) parts of the graph and execute those jobs (since all the new jobs' dependent jobs had previously completed).

    However, it seems that this possibly was a bad assumption on my part.
    On this large job, the dry run properly lists all of the pending jobs.
    But, on the actual run, it only releases the next few hundred jobs and then gets stuck in some loop as we know.
    Is this behavior expected? Is there anything to do to possibly mitigate this?

    I should also say that I took the toy run after it finished and similarly added on new jobs that are dependent only on finished jobs. In this smaller case, it does seem to behaving as I had hoped: adding in the new jobs, running them, and then adding new jobs as the previous ones complete. In other words, it's not stalling for the toy example.

    I've also followed your suggestions above and output a .dot file for both a toy run and the large run.
    I've then used graphviz (http://www.graphviz.org/Documentation.php) to test that both graphs are acyclic (they are). Using graphviz, they each consist of 2 connected components (I would've expected 1 each, but that does not seem to be a differentiating factor between these runs).

    This is part of a large project (here at the Broad) with an upcoming deadline and it took at least a few weeks to run the first part of the Qscript, so running this again from scratch is not a realistic option right now.

    Short of writing a new Qscript that performs only the latter functions, is there any other obvious solution?

    Thanks for your help!

  • fromerfromer NYCMember ✭✭

    Since this issue was time-critical, the solution I've reached is to in fact write a new Qscript that performs only the latter functions.

    We can consider this issue closed for the time being.

    It would still be interesting to get to the bottom of the above issue, but I can see that this is out of the scope of standard Queue scripting which assumes that you define a Qscript and then run it. That is, without editing the Qscript later to run additional downstream jobs without starting from the beginning.

  • fromerfromer NYCMember ✭✭

    Thanks all for your comments, suggestions, and help.

Sign In or Register to comment.