gatk "queue" - just getting started, trying to get "hello world" example working with Grid Engine.

caseybeacaseybea Posts: 6Member

Good morning team!

First, I have to qualify my question with that I'm a unix sysadmin- trying to get the "queue" functionality implemented in our cluster so our analysts can play. I'm hoping my question is simple, here goes:

We have SGE, and I have downloaded the binary "queue" package.

My first attempt at executing the "hello world" example came up with this error:

kcb@lima:~> java -jar /apps/Queue-2.5-2-gf57256b/Queue.jar -S /apps/Queue-2.5-2-gf57256b/examples/HelloWorld.scala -jobRunner GridEngine -run INFO 11:04:28,560 QScriptManager - Compiling 1 QScript INFO 11:04:31,265 QScriptManager - Compilation complete INFO 11:04:31,340 HelpFormatter - ---------------------------------------------------------------------- INFO 11:04:31,340 HelpFormatter - Queue v2.5-2-gf57256b, Compiled 2013/05/01 09:29:04 INFO 11:04:31,340 HelpFormatter - Copyright (c) 2012 The Broad Institute INFO 11:04:31,340 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 11:04:31,341 HelpFormatter - Program Args: -S /apps/Queue-2.5-2-gf57256b/examples/HelloWorld.scala -jobRunner GridEngine -run INFO 11:04:31,341 HelpFormatter - Date/Time: 2013/06/05 11:04:31 INFO 11:04:31,341 HelpFormatter - ---------------------------------------------------------------------- INFO 11:04:31,341 HelpFormatter - ---------------------------------------------------------------------- INFO 11:04:31,346 QCommandLine - Scripting HelloWorld INFO 11:04:31,363 QCommandLine - Added 1 functions INFO 11:04:31,364 QGraph - Generating graph. INFO 11:04:31,373 QGraph - Running jobs. ERROR 11:04:31,427 QGraph - Uncaught error running jobs. java.lang.UnsatisfiedLinkError: Unable to load library 'drmaa': libdrmaa.so: cannot open shared object file: No such file or directory

ooops! Seems I can't find the drmaa library by default. So, I fixed that by adding the following directory to the library search path on the node: /gridware/sge/lib/lx-amd64 (which is where that library lives).

Success! Sort of. The error above is resolved, but I am now getting the error below, and this is where I'm stuck. It doesn't look like the job is actually getting submitted, OR, it's getting submitted and dies. I would really appreciate any insight the team can offer, we are very excited to try to get this environment to work, thank you in advance!

kcb@lima:~> java -jar /apps/Queue-2.5-2-gf57256b/Queue.jar -S /apps/Queue-2.5-2-gf57256b/examples/HelloWorld.scala -jobRunner GridEngine -run INFO 11:07:52,728 QScriptManager - Compiling 1 QScript INFO 11:07:55,208 QScriptManager - Compilation complete INFO 11:07:55,271 HelpFormatter - ---------------------------------------------------------------------- INFO 11:07:55,271 HelpFormatter - Queue v2.5-2-gf57256b, Compiled 2013/05/01 09:29:04 INFO 11:07:55,271 HelpFormatter - Copyright (c) 2012 The Broad Institute INFO 11:07:55,271 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 11:07:55,272 HelpFormatter - Program Args: -S /apps/Queue-2.5-2-gf57256b/examples/HelloWorld.scala -jobRunner GridEngine -run INFO 11:07:55,272 HelpFormatter - Date/Time: 2013/06/05 11:07:55 INFO 11:07:55,272 HelpFormatter - ---------------------------------------------------------------------- INFO 11:07:55,272 HelpFormatter - ---------------------------------------------------------------------- INFO 11:07:55,276 QCommandLine - Scripting HelloWorld INFO 11:07:55,292 QCommandLine - Added 1 functions INFO 11:07:55,292 QGraph - Generating graph. INFO 11:07:55,298 QGraph - Running jobs. INFO 11:07:55,481 FunctionEdge - Starting: echo hello world INFO 11:07:55,482 FunctionEdge - Output written to /shared/users/kcb/HelloWorld-1.out ERROR 11:07:55,507 Retry - Caught error during attempt 1 of 4. org.ggf.drmaa.InternalException: Error reading answer list from qmaster at org.broadinstitute.sting.jna.drmaa.v1_0.JnaSession.checkError(JnaSession.java:400) at org.broadinstitute.sting.jna.drmaa.v1_0.JnaSession.checkError(JnaSession.java:392) at org.broadinstitute.sting.jna.drmaa.v1_0.JnaSession.runJob(JnaSession.java:79) at org.broadinstitute.sting.queue.engine.drmaa.DrmaaJobRunner$$anonfun$liftedTree1$1$1.apply$mcV$sp(DrmaaJobRunner.scala:87) at org.broadinstitute.sting.queue.engine.drmaa.DrmaaJobRunner$$anonfun$liftedTree1$1$1.apply(DrmaaJobRunner.scala:85) at org.broadinstitute.sting.queue.engine.drmaa.DrmaaJobRunner$$anonfun$liftedTree1$1$1.apply(DrmaaJobRunner.scala:85) at org.broadinstitute.sting.queue.util.Retry$.attempt(Retry.scala:49) at org.broadinstitute.sting.queue.engine.drmaa.DrmaaJobRunner.liftedTree1$1(DrmaaJobRunner.scala:85) at org.broadinstitute.sting.queue.engine.drmaa.DrmaaJobRunner.start(DrmaaJobRunner.scala:84) at org.broadinstitute.sting.queue.engine.FunctionEdge.start(FunctionEdge.scala:84) at org.broadinstitute.sting.queue.engine.QGraph.runJobs(QGraph.scala:434) at org.broadinstitute.sting.queue.engine.QGraph.run(QGraph.scala:156) at org.broadinstitute.sting.queue.QCommandLine.execute(QCommandLine.scala:171) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:245) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:152) at org.broadinstitute.sting.queue.QCommandLine$.main(QCommandLine.scala:62) at org.broadinstitute.sting.queue.QCommandLine.main(QCommandLine.scala) ERROR 11:07:55,510 Retry - Retrying in 1.0 minute.

Answers

  • caseybeacaseybea Posts: 6Member

    I have to add: Running the job without the gridengine jobrunner WORKS, so it doesn't look like an issue with the required basics.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,643Administrator, GATK Developer admin

    Hi @caseybea,

    Welcome to GATK! We'll do what we can to help you set up the playroom for your users :)

    Although the first thing I'm going to do is punt on your question, because we don't use SGE ourselves, and the job runner is mostly the result of external contributions iirc. We have a few users here who do have much more experience with it than us, particularly @Johan_Dahlberg who has submitted patches to the drmaa job runner. Hopefully he (or others) might have a minute to jump in and perhaps shed some light on the behavior you're seeing.

    Geraldine Van der Auwera, PhD

  • caseybeacaseybea Posts: 6Member

    Hm. I may have jumped the gun. Before I even introduce the jobrunner stuff, I thought QUEUE was working to completion. Not so sure now?

    This is what I get when running the hello-world example, no queue runner:

    kcb@lima:~> java -jar /apps/Queue-2.5-2-gf57256b/Queue.jar -S /apps/Queue-2.5-2-gf57256b/examples/HelloWorld.scala -run
    INFO  12:42:27,267 QScriptManager - Compiling 1 QScript 
    INFO  12:42:29,707 QScriptManager - Compilation complete 
    INFO  12:42:29,776 HelpFormatter - ---------------------------------------------------------------------- 
    INFO  12:42:29,776 HelpFormatter - Queue v2.5-2-gf57256b, Compiled 2013/05/01 09:29:04 
    INFO  12:42:29,777 HelpFormatter - Copyright (c) 2012 The Broad Institute 
    INFO  12:42:29,777 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk 
    INFO  12:42:29,777 HelpFormatter - Program Args: -S /apps/Queue-2.5-2-gf57256b/examples/HelloWorld.scala -run 
    INFO  12:42:29,777 HelpFormatter - Date/Time: 2013/06/05 12:42:29 
    INFO  12:42:29,777 HelpFormatter - ---------------------------------------------------------------------- 
    INFO  12:42:29,777 HelpFormatter - ---------------------------------------------------------------------- 
    INFO  12:42:29,782 QCommandLine - Scripting HelloWorld 
    INFO  12:42:29,798 QCommandLine - Added 1 functions 
    INFO  12:42:29,799 QGraph - Generating graph. 
    INFO  12:42:29,805 QGraph - Running jobs. 
    INFO  12:42:29,812 QGraph - 0 Pend, 0 Run, 0 Fail, 1 Done 
    INFO  12:42:29,815 QCommandLine - Writing final jobs report... 
    INFO  12:42:29,815 QJobsReporter - Writing JobLogging GATKReport to file /shared/users/kcb/HelloWorld.jobreport.txt 
    INFO  12:42:29,824 QJobsReporter - Plotting JobLogging GATKReport to file /shared/users/kcb/HelloWorld.jobreport.pdf 
    WARN  12:42:29,827 RScriptExecutor - Skipping: Rscript (resource)org/broadinstitute/sting/queue/util/queueJobReport.R /shared/users/kcb/HelloWorld.jobreport.txt /shared/users/kcb/HelloWorld.jobreport.pdf 
    INFO  12:42:29,828 QCommandLine - Script completed successfully with 1 total jobs 
    

    The only output I see is the HelloWorld.jobreport.txt file, and all that's in it is the following. I don't actually see output?:

    #:GATKReport.v1.1:0
    
  • pdexheimerpdexheimer Posts: 372Member, GSA Collaborator ✭✭✭
    edited June 2013

    Run it with -startFromScratch. You've run it successfully once, and Queue noted that (with an empty file called .SOMETHING.done). When you reran, it saw that it earlier success and didn't bother running the job (notice that immediately after Running jobs it claimed success)

    Post edited by pdexheimer on
  • caseybeacaseybea Posts: 6Member

    Ah! OK, that fixed my intermediary issue, I can once again verify that this works without the jobrunner (thank you!!).

    I'm now back to my original issue, hoping someone can shed light. I also did verify the variety of "qsub" examples as shown in the gatk/queue debugging web page all work fine.

  • caseybeacaseybea Posts: 6Member
    edited June 2013

    Hi everyone! I really appreciate the couple of tips added above-- but I and sadly still trying to figure out why the job(s) don't actually execute in SGE. If anyone that is familiar with this can assist, that would be awesome. I promise to followup with personal notes and observations about how it all works. @Johan_Dahlberg - might you be able to take a moment to view the error? Our entire sequencing core here is totally excited about the possibility of getting GATK to operate across multiple nodes!

    Post edited by caseybea on
  • caseybeacaseybea Posts: 6Member

    I will post the error below here formatted in a cleaner way for easy reading. In reviewing my original post, it's a mess (sorry about that!)

    kcb@lima:~> java -Xmx2048M -jar /apps/Queue-2.5-2-gf57256b/Queue.jar -S /apps/Queue-2.5-2-gf57256b/examples/HelloWorld.scala -qsub -jobQueue fast.q -run
    INFO  13:36:23,737 QScriptManager - Compiling 1 QScript 
    INFO  13:36:26,320 QScriptManager - Compilation complete 
    INFO  13:36:26,399 HelpFormatter - ---------------------------------------------------------------------- 
    INFO  13:36:26,399 HelpFormatter - Queue v2.5-2-gf57256b, Compiled 2013/05/01 09:29:04 
    INFO  13:36:26,400 HelpFormatter - Copyright (c) 2012 The Broad Institute 
    INFO  13:36:26,400 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk 
    INFO  13:36:26,400 HelpFormatter - Program Args: -S /apps/Queue-2.5-2-gf57256b/examples/HelloWorld.scala -qsub -jobQueue fast.q -run 
    INFO  13:36:26,400 HelpFormatter - Date/Time: 2013/06/12 13:36:26 
    INFO  13:36:26,400 HelpFormatter - ---------------------------------------------------------------------- 
    INFO  13:36:26,400 HelpFormatter - ---------------------------------------------------------------------- 
    INFO  13:36:26,405 QCommandLine - Scripting HelloWorld 
    INFO  13:36:26,423 QCommandLine - Added 1 functions 
    INFO  13:36:26,438 QGraph - Generating graph. 
    INFO  13:36:26,450 QGraph - Running jobs. 
    INFO  13:36:26,735 FunctionEdge - Starting: echo hello world 
    INFO  13:36:26,736 FunctionEdge - Output written to /shared/users/kcb/HelloWorld-1.out 
    ERROR 13:36:26,783 Retry - Caught error during attempt 1 of 4. 
    org.ggf.drmaa.InternalException: Error reading answer list from qmaster
        at org.broadinstitute.sting.jna.drmaa.v1_0.JnaSession.checkError(JnaSession.java:400)
        at org.broadinstitute.sting.jna.drmaa.v1_0.JnaSession.checkError(JnaSession.java:392)
        at org.broadinstitute.sting.jna.drmaa.v1_0.JnaSession.runJob(JnaSession.java:79)
        at org.broadinstitute.sting.queue.engine.drmaa.DrmaaJobRunner$$anonfun$liftedTree1$1$1.apply$mcV$sp(DrmaaJobRunner.scala:87)
        at org.broadinstitute.sting.queue.engine.drmaa.DrmaaJobRunner$$anonfun$liftedTree1$1$1.apply(DrmaaJobRunner.scala:85)
        at org.broadinstitute.sting.queue.engine.drmaa.DrmaaJobRunner$$anonfun$liftedTree1$1$1.apply(DrmaaJobRunner.scala:85)
        at org.broadinstitute.sting.queue.util.Retry$.attempt(Retry.scala:49)
        at org.broadinstitute.sting.queue.engine.drmaa.DrmaaJobRunner.liftedTree1$1(DrmaaJobRunner.scala:85)
        at org.broadinstitute.sting.queue.engine.drmaa.DrmaaJobRunner.start(DrmaaJobRunner.scala:84)
        at org.broadinstitute.sting.queue.engine.FunctionEdge.start(FunctionEdge.scala:84)
        at org.broadinstitute.sting.queue.engine.QGraph.runJobs(QGraph.scala:434)
        at org.broadinstitute.sting.queue.engine.QGraph.run(QGraph.scala:156)
        at org.broadinstitute.sting.queue.QCommandLine.execute(QCommandLine.scala:171)
        at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:245)
        at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:152)
        at org.broadinstitute.sting.queue.QCommandLine$.main(QCommandLine.scala:62)
        at org.broadinstitute.sting.queue.QCommandLine.main(QCommandLine.scala)
    ERROR 13:36:26,787 Retry - Retrying in 1.0 minute. 
    
  • Johan_DahlbergJohan_Dahlberg Posts: 85Member ✭✭✭

    Sorry for the very late answer. I've been to busy with other stuff to drop by here. Looking at the error above I'd say that: org.ggf.drmaa.InternalException: Error reading answer list from qmaster is the key solving this problem - however I did some quick looking around for something to shed some light on this and the only things I found were 2 C source files (and since I don't read C it din't help me very much).

    Here are the links if anyone can make sense of it:

    https://github.com/gridengine/gridengine/blob/master/source/libs/japi/msg_japi.h

    http://arc.liv.ac.uk/repos/hg/sge/source/libs/japi/japi.c

    Searching for "MSG_JAPI_BAD_GDI_ANSWER_LIST" in the first link should take you were you want to go. But as I said, since my C almost nonexistent I can't really figure out in any detail what the method does.

    My ideas on how to move forward with this would be to:

    1) Add a Thread.sleep(120*1000) to the HelloWorld script to get the script to stay on the node for 2 minutes (or as long as you need), and see if it pops up in the job "running jobs list". Since I don't use GE myself I can't provide a command to do this, but I guess that there is some equivalent of squeue in SLURM. If it doesn't show up in the list then at least you can conclude that It's not being sent to a node by GE.

    2) Have a look in the code here: https://github.com/broadgsa/gatk-protected/blob/2a7af4316478348f7ea58e0803b3391593d6dbd6/public/scala/src/org/broadinstitute/sting/queue/engine/gridengine/GridEngineJobRunner.scala

    To see if all arguments that you would normally need to set when running jobs manually are being set correctly. This was my problem when first starting to use Queue. Since out cluster enforces that a time argument has to be sent with the job, and Queue didn't give one, my jobs were not sent to the queue.

    3) If you don't get that to work try running Queue with -jobRunner Drmaa -jobNative <what ever args you need> and see if that works better. I run on SLURM using only the default Drmaa jobRunner and that works great.

    As I said, sorry for the late answer. Please let me know if there's anything more I can help with.

  • Johan_DahlbergJohan_Dahlberg Posts: 85Member ✭✭✭

    Also: Adding -l DEBUG to the command line might turn up some more information on what's going on.

  • louwslouws San FranciscoPosts: 15Member

    I know it's a year old, but I also came across libdrmaa issues. Aside from the excellent advice above, I would recommend the obvious check to see if you are trying to submit a Queue job from within a node on your cluster. Some clusters restrict submitting jobs that uses the libdrmaa scheduler from within a node (even if you make the path to the lib explicit with the Djava.library.path=[path to libdrmaa]. Solution is to try submitting the job from the headnode.

Sign In or Register to comment.