Problems with GridEngine

jturnerjturner Posts: 10Member
edited January 2013 in Ask the GATK team


I am trying to run the GATK variant detection pipeline on 112 stickleback samples. I am using a GridEngine queue to parallelize this across our different machines. I have previously run the same code on a subset of the samples (55) and it worked fine. However, when I have tried to run on the full 112, I have run into some strange errors. In particular, things like:

commlib returns can't find connection

WARN  13:58:57,655 DrmaaJobRunner - Unable to determine status of job id 4970049 
org.ggf.drmaa.DrmCommunicationException: failed receiving gdi request response for mid=19906 (can't find connection).
        at org.broadinstitute.sting.jna.drmaa.v1_0.JnaSession.checkError(
        at org.broadinstitute.sting.jna.drmaa.v1_0.JnaSession.checkError(
        at org.broadinstitute.sting.jna.drmaa.v1_0.JnaSession.getJobProgramStatus(
        at org.broadinstitute.sting.queue.engine.drmaa.DrmaaJobRunner.liftedTree2$1(DrmaaJobRunner.scala:101)
        at org.broadinstitute.sting.queue.engine.drmaa.DrmaaJobRunner.updateJobStatus(DrmaaJobRunner.scala:100)
        at org.broadinstitute.sting.queue.engine.drmaa.DrmaaJobManager$$anonfun$updateStatus$1.apply(DrmaaJobManager.scala:55)
        at org.broadinstitute.sting.queue.engine.drmaa.DrmaaJobManager$$anonfun$updateStatus$1.apply(DrmaaJobManager.scala:55)
        at scala.collection.immutable.HashSet$HashSet1.foreach(HashSet.scala:123)
        at scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:322)
        at scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:322)
        at scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:322)
        at org.broadinstitute.sting.queue.engine.drmaa.DrmaaJobManager.updateStatus(DrmaaJobManager.scala:55)
        at org.broadinstitute.sting.queue.engine.QGraph$$anonfun$updateStatus$1.apply(QGraph.scala:1076)
        at org.broadinstitute.sting.queue.engine.QGraph$$anonfun$updateStatus$1.apply(QGraph.scala:1068)
        at scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:61)
        at scala.collection.immutable.List.foreach(List.scala:45)
        at org.broadinstitute.sting.queue.engine.QGraph.updateStatus(QGraph.scala:1068)
        at org.broadinstitute.sting.queue.engine.QGraph.runJobs(QGraph.scala:442)
        at org.broadinstitute.sting.queue.QCommandLine.execute(QCommandLine.scala:127)
        at org.broadinstitute.sting.commandline.CommandLineProgram.start(
        at org.broadinstitute.sting.commandline.CommandLineProgram.start(
        at org.broadinstitute.sting.queue.QCommandLine$.main(QCommandLine.scala:62)
        at org.broadinstitute.sting.queue.QCommandLine.main(QCommandLine.scala)

crop up, followed by something like:

error: smallest event number 108 is greater than number 1 i'm waiting for

Does anyone have any idea of what might be going wrong? Either way, do you have any suggestions to help me move forward?

As a note, I have not tried running the 55 again, so it is possible that this would also now fail. In other words, I don't know whether the problem is due to some difference between the 55 and 112 sets, or if some part of the GATK that has been updated in the interim has introduced the problem. I can try running the original set again if it would be helpful.

Thanks, Jason

Post edited by Geraldine_VdAuwera on

Best Answer

  • Johan_DahlbergJohan_Dahlberg Posts: 85Member ✭✭✭
    Answer ✓

    This is some guesswork on my part, since I'm not familiar with the specifics of GridEngine. First of all "commlib returns can't find connection" indicates that this is caused by a failure of drmaa in communicating with GridEngine. The exception being thrown is documented here: and points to the same thing. Are you sure that your cluster has not been experiencing problems as you were running? That would be one option to explore.

    Another thing that might be causing the problem, judging from the other error message, is that drmaa is waiting for a return of 0 for successful jobs, and 1 for failed ones, but the job then returns with code 108. I found some stuff while googling around that indicated that code 108 might be a sign of running out of memory (at least on OSX systems). Does your script process all of the samples together in some step? If so, it might be that you run out of memory when running on the full set, but not on the subset.

    If the problems is with the cluster - see if you can get your sysadmin to figure it out? Or if it is a problem in the script, if you post it here, I might be able to help you more.

    As I said, I'm no expert with GridEngine, so this are more of pointer to where I would start looking, than and actual answer. Hope that it helps anyway.


  • jturnerjturner Posts: 10Member

    Hi Johan,

    Thanks for your answer. I don't think that we've been having problems with the cluster as I was running, but I'll try running it again and see if I can see anything.

    As for the 108 being an error code, the event number printed is not constant. The first time I tried to run the code, for example, the event number was 7657. I think that what it is saying is that it has spawned some number of GridEngine jobs, but it thinks that some have finished without it noticing (i.e. it's still waiting for job 1, but the lowest numbered remaining job is 108).

    Thanks for the advice. The script is adapted from another, more elaborate script, so I'm hesitant to post it as-is since I think it's a bit opaque. I'll try running again and keeping an eye on the system, and if that fails I'll clean up the code a bit and post it.

    Thanks again, Jason

  • Johan_DahlbergJohan_Dahlberg Posts: 85Member ✭✭✭

    I see. Hope it will work. If not I think the the problem might relate to either the DrmaaJobManager or DrmaaJobRunner classes, and if that is the case, I think finding the solution might be quite complex.

  • jturnerjturner Posts: 10Member

    Hi Johan,

    Thanks for the advice. In that case, if this doesn't work I'll likely try some other parallelization method, such as just using one machine's worth of processors instead of GridEngine. The run is going ok so far, though!

    Thanks, Jason

  • jturnerjturner Posts: 10Member

    Hi again,

    Huh. It ran through fine that time? While nondeterministic code behavior makes me nervous, I guess I'll take it... Thanks again for your help.

    Regards, Jason

Sign In or Register to comment.