The current GATK version is 3.3-0

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

# gatk "queue" - just getting started, trying to get "hello world" example working with Grid Engine.

Posts: 6Member

Good morning team!

First, I have to qualify my question with that I'm a unix sysadmin- trying to get the "queue" functionality implemented in our cluster so our analysts can play. I'm hoping my question is simple, here goes:

My first attempt at executing the "hello world" example came up with this error:

kcb@lima:~> java -jar /apps/Queue-2.5-2-gf57256b/Queue.jar -S /apps/Queue-2.5-2-gf57256b/examples/HelloWorld.scala -jobRunner GridEngine -run INFO 11:04:28,560 QScriptManager - Compiling 1 QScript INFO 11:04:31,265 QScriptManager - Compilation complete INFO 11:04:31,340 HelpFormatter - ---------------------------------------------------------------------- INFO 11:04:31,340 HelpFormatter - Queue v2.5-2-gf57256b, Compiled 2013/05/01 09:29:04 INFO 11:04:31,340 HelpFormatter - Copyright (c) 2012 The Broad Institute INFO 11:04:31,340 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 11:04:31,341 HelpFormatter - Program Args: -S /apps/Queue-2.5-2-gf57256b/examples/HelloWorld.scala -jobRunner GridEngine -run INFO 11:04:31,341 HelpFormatter - Date/Time: 2013/06/05 11:04:31 INFO 11:04:31,341 HelpFormatter - ---------------------------------------------------------------------- INFO 11:04:31,341 HelpFormatter - ---------------------------------------------------------------------- INFO 11:04:31,346 QCommandLine - Scripting HelloWorld INFO 11:04:31,363 QCommandLine - Added 1 functions INFO 11:04:31,364 QGraph - Generating graph. INFO 11:04:31,373 QGraph - Running jobs. ERROR 11:04:31,427 QGraph - Uncaught error running jobs. java.lang.UnsatisfiedLinkError: Unable to load library 'drmaa': libdrmaa.so: cannot open shared object file: No such file or directory

ooops! Seems I can't find the drmaa library by default. So, I fixed that by adding the following directory to the library search path on the node: /gridware/sge/lib/lx-amd64 (which is where that library lives).

Success! Sort of. The error above is resolved, but I am now getting the error below, and this is where I'm stuck. It doesn't look like the job is actually getting submitted, OR, it's getting submitted and dies. I would really appreciate any insight the team can offer, we are very excited to try to get this environment to work, thank you in advance!

kcb@lima:~> java -jar /apps/Queue-2.5-2-gf57256b/Queue.jar -S /apps/Queue-2.5-2-gf57256b/examples/HelloWorld.scala -jobRunner GridEngine -run INFO 11:07:52,728 QScriptManager - Compiling 1 QScript INFO 11:07:55,208 QScriptManager - Compilation complete INFO 11:07:55,271 HelpFormatter - ---------------------------------------------------------------------- INFO 11:07:55,271 HelpFormatter - Queue v2.5-2-gf57256b, Compiled 2013/05/01 09:29:04 INFO 11:07:55,271 HelpFormatter - Copyright (c) 2012 The Broad Institute INFO 11:07:55,271 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 11:07:55,272 HelpFormatter - Program Args: -S /apps/Queue-2.5-2-gf57256b/examples/HelloWorld.scala -jobRunner GridEngine -run INFO 11:07:55,272 HelpFormatter - Date/Time: 2013/06/05 11:07:55 INFO 11:07:55,272 HelpFormatter - ---------------------------------------------------------------------- INFO 11:07:55,272 HelpFormatter - ---------------------------------------------------------------------- INFO 11:07:55,276 QCommandLine - Scripting HelloWorld INFO 11:07:55,292 QCommandLine - Added 1 functions INFO 11:07:55,292 QGraph - Generating graph. INFO 11:07:55,298 QGraph - Running jobs. INFO 11:07:55,481 FunctionEdge - Starting: echo hello world INFO 11:07:55,482 FunctionEdge - Output written to /shared/users/kcb/HelloWorld-1.out ERROR 11:07:55,507 Retry - Caught error during attempt 1 of 4. org.ggf.drmaa.InternalException: Error reading answer list from qmaster at org.broadinstitute.sting.jna.drmaa.v1_0.JnaSession.checkError(JnaSession.java:400) at org.broadinstitute.sting.jna.drmaa.v1_0.JnaSession.checkError(JnaSession.java:392) at org.broadinstitute.sting.jna.drmaa.v1_0.JnaSession.runJob(JnaSession.java:79) at org.broadinstitute.sting.queue.engine.drmaa.DrmaaJobRunner$$anonfunliftedTree111.applymcVsp(DrmaaJobRunner.scala:87) at org.broadinstitute.sting.queue.engine.drmaa.DrmaaJobRunner$$anonfun$liftedTree1$1$1.apply(DrmaaJobRunner.scala:85) at org.broadinstitute.sting.queue.engine.drmaa.DrmaaJobRunner$$anonfunliftedTree111.apply(DrmaaJobRunner.scala:85) at org.broadinstitute.sting.queue.util.Retry.attempt(Retry.scala:49) at org.broadinstitute.sting.queue.engine.drmaa.DrmaaJobRunner.liftedTree11(DrmaaJobRunner.scala:85) at org.broadinstitute.sting.queue.engine.drmaa.DrmaaJobRunner.start(DrmaaJobRunner.scala:84) at org.broadinstitute.sting.queue.engine.FunctionEdge.start(FunctionEdge.scala:84) at org.broadinstitute.sting.queue.engine.QGraph.runJobs(QGraph.scala:434) at org.broadinstitute.sting.queue.engine.QGraph.run(QGraph.scala:156) at org.broadinstitute.sting.queue.QCommandLine.execute(QCommandLine.scala:171) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:245) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:152) at org.broadinstitute.sting.queue.QCommandLine.main(QCommandLine.scala:62) at org.broadinstitute.sting.queue.QCommandLine.main(QCommandLine.scala) ERROR 11:07:55,510 Retry - Retrying in 1.0 minute. Tagged: ## Answers • Posts: 6Member I have to add: Running the job without the gridengine jobrunner WORKS, so it doesn't look like an issue with the required basics. • Posts: 6,453Administrator, GATK Developer admin Hi @caseybea, Welcome to GATK! We'll do what we can to help you set up the playroom for your users :) Although the first thing I'm going to do is punt on your question, because we don't use SGE ourselves, and the job runner is mostly the result of external contributions iirc. We have a few users here who do have much more experience with it than us, particularly @Johan_Dahlberg who has submitted patches to the drmaa job runner. Hopefully he (or others) might have a minute to jump in and perhaps shed some light on the behavior you're seeing. Geraldine Van der Auwera, PhD • Posts: 6Member Hm. I may have jumped the gun. Before I even introduce the jobrunner stuff, I thought QUEUE was working to completion. Not so sure now? This is what I get when running the hello-world example, no queue runner: kcb@lima:~> java -jar /apps/Queue-2.5-2-gf57256b/Queue.jar -S /apps/Queue-2.5-2-gf57256b/examples/HelloWorld.scala -run INFO 12:42:27,267 QScriptManager - Compiling 1 QScript INFO 12:42:29,707 QScriptManager - Compilation complete INFO 12:42:29,776 HelpFormatter - ---------------------------------------------------------------------- INFO 12:42:29,776 HelpFormatter - Queue v2.5-2-gf57256b, Compiled 2013/05/01 09:29:04 INFO 12:42:29,777 HelpFormatter - Copyright (c) 2012 The Broad Institute INFO 12:42:29,777 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 12:42:29,777 HelpFormatter - Program Args: -S /apps/Queue-2.5-2-gf57256b/examples/HelloWorld.scala -run INFO 12:42:29,777 HelpFormatter - Date/Time: 2013/06/05 12:42:29 INFO 12:42:29,777 HelpFormatter - ---------------------------------------------------------------------- INFO 12:42:29,777 HelpFormatter - ---------------------------------------------------------------------- INFO 12:42:29,782 QCommandLine - Scripting HelloWorld INFO 12:42:29,798 QCommandLine - Added 1 functions INFO 12:42:29,799 QGraph - Generating graph. INFO 12:42:29,805 QGraph - Running jobs. INFO 12:42:29,812 QGraph - 0 Pend, 0 Run, 0 Fail, 1 Done INFO 12:42:29,815 QCommandLine - Writing final jobs report... INFO 12:42:29,815 QJobsReporter - Writing JobLogging GATKReport to file /shared/users/kcb/HelloWorld.jobreport.txt INFO 12:42:29,824 QJobsReporter - Plotting JobLogging GATKReport to file /shared/users/kcb/HelloWorld.jobreport.pdf WARN 12:42:29,827 RScriptExecutor - Skipping: Rscript (resource)org/broadinstitute/sting/queue/util/queueJobReport.R /shared/users/kcb/HelloWorld.jobreport.txt /shared/users/kcb/HelloWorld.jobreport.pdf INFO 12:42:29,828 QCommandLine - Script completed successfully with 1 total jobs  The only output I see is the HelloWorld.jobreport.txt file, and all that's in it is the following. I don't actually see output?: #:GATKReport.v1.1:0  • Posts: 362Member, GSA Collaborator ✭✭✭ edited June 2013 Run it with -startFromScratch. You've run it successfully once, and Queue noted that (with an empty file called .SOMETHING.done). When you reran, it saw that it earlier success and didn't bother running the job (notice that immediately after Running jobs it claimed success) Post edited by pdexheimer on • Posts: 6Member Ah! OK, that fixed my intermediary issue, I can once again verify that this works without the jobrunner (thank you!!). I'm now back to my original issue, hoping someone can shed light. I also did verify the variety of "qsub" examples as shown in the gatk/queue debugging web page all work fine. • Posts: 6Member edited June 2013 Hi everyone! I really appreciate the couple of tips added above-- but I and sadly still trying to figure out why the job(s) don't actually execute in SGE. If anyone that is familiar with this can assist, that would be awesome. I promise to followup with personal notes and observations about how it all works. @Johan_Dahlberg - might you be able to take a moment to view the error? Our entire sequencing core here is totally excited about the possibility of getting GATK to operate across multiple nodes! Post edited by caseybea on • Posts: 6Member I will post the error below here formatted in a cleaner way for easy reading. In reviewing my original post, it's a mess (sorry about that!) kcb@lima:~> java -Xmx2048M -jar /apps/Queue-2.5-2-gf57256b/Queue.jar -S /apps/Queue-2.5-2-gf57256b/examples/HelloWorld.scala -qsub -jobQueue fast.q -run INFO 13:36:23,737 QScriptManager - Compiling 1 QScript INFO 13:36:26,320 QScriptManager - Compilation complete INFO 13:36:26,399 HelpFormatter - ---------------------------------------------------------------------- INFO 13:36:26,399 HelpFormatter - Queue v2.5-2-gf57256b, Compiled 2013/05/01 09:29:04 INFO 13:36:26,400 HelpFormatter - Copyright (c) 2012 The Broad Institute INFO 13:36:26,400 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 13:36:26,400 HelpFormatter - Program Args: -S /apps/Queue-2.5-2-gf57256b/examples/HelloWorld.scala -qsub -jobQueue fast.q -run INFO 13:36:26,400 HelpFormatter - Date/Time: 2013/06/12 13:36:26 INFO 13:36:26,400 HelpFormatter - ---------------------------------------------------------------------- INFO 13:36:26,400 HelpFormatter - ---------------------------------------------------------------------- INFO 13:36:26,405 QCommandLine - Scripting HelloWorld INFO 13:36:26,423 QCommandLine - Added 1 functions INFO 13:36:26,438 QGraph - Generating graph. INFO 13:36:26,450 QGraph - Running jobs. INFO 13:36:26,735 FunctionEdge - Starting: echo hello world INFO 13:36:26,736 FunctionEdge - Output written to /shared/users/kcb/HelloWorld-1.out ERROR 13:36:26,783 Retry - Caught error during attempt 1 of 4. org.ggf.drmaa.InternalException: Error reading answer list from qmaster at org.broadinstitute.sting.jna.drmaa.v1_0.JnaSession.checkError(JnaSession.java:400) at org.broadinstitute.sting.jna.drmaa.v1_0.JnaSession.checkError(JnaSession.java:392) at org.broadinstitute.sting.jna.drmaa.v1_0.JnaSession.runJob(JnaSession.java:79) at org.broadinstitute.sting.queue.engine.drmaa.DrmaaJobRunner$$anonfun$liftedTree1$1$1.apply$mcV$sp(DrmaaJobRunner.scala:87)
at org.broadinstitute.sting.queue.engine.drmaa.DrmaaJobRunner$$anonfunliftedTree111.apply(DrmaaJobRunner.scala:85) at org.broadinstitute.sting.queue.engine.drmaa.DrmaaJobRunner$$anonfun$liftedTree1$1$1.apply(DrmaaJobRunner.scala:85) at org.broadinstitute.sting.queue.util.Retry$.attempt(Retry.scala:49)
at org.broadinstitute.sting.queue.engine.drmaa.DrmaaJobRunner.liftedTree1$1(DrmaaJobRunner.scala:85) at org.broadinstitute.sting.queue.engine.drmaa.DrmaaJobRunner.start(DrmaaJobRunner.scala:84) at org.broadinstitute.sting.queue.engine.FunctionEdge.start(FunctionEdge.scala:84) at org.broadinstitute.sting.queue.engine.QGraph.runJobs(QGraph.scala:434) at org.broadinstitute.sting.queue.engine.QGraph.run(QGraph.scala:156) at org.broadinstitute.sting.queue.QCommandLine.execute(QCommandLine.scala:171) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:245) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:152) at org.broadinstitute.sting.queue.QCommandLine$.main(QCommandLine.scala:62)
ERROR 13:36:26,787 Retry - Retrying in 1.0 minute.

• Posts: 85Member ✭✭✭

Sorry for the very late answer. I've been to busy with other stuff to drop by here. Looking at the error above I'd say that: org.ggf.drmaa.InternalException: Error reading answer list from qmaster is the key solving this problem - however I did some quick looking around for something to shed some light on this and the only things I found were 2 C source files (and since I don't read C it din't help me very much).

Here are the links if anyone can make sense of it:

https://github.com/gridengine/gridengine/blob/master/source/libs/japi/msg_japi.h

http://arc.liv.ac.uk/repos/hg/sge/source/libs/japi/japi.c

Searching for "MSG_JAPI_BAD_GDI_ANSWER_LIST" in the first link should take you were you want to go. But as I said, since my C almost nonexistent I can't really figure out in any detail what the method does.

My ideas on how to move forward with this would be to:

1) Add a Thread.sleep(120*1000) to the HelloWorld script to get the script to stay on the node for 2 minutes (or as long as you need), and see if it pops up in the job "running jobs list". Since I don't use GE myself I can't provide a command to do this, but I guess that there is some equivalent of squeue in SLURM. If it doesn't show up in the list then at least you can conclude that It's not being sent to a node by GE.

To see if all arguments that you would normally need to set when running jobs manually are being set correctly. This was my problem when first starting to use Queue. Since out cluster enforces that a time argument has to be sent with the job, and Queue didn't give one, my jobs were not sent to the queue.

3) If you don't get that to work try running Queue with -jobRunner Drmaa -jobNative <what ever args you need> and see if that works better. I run on SLURM using only the default Drmaa jobRunner and that works great.

As I said, sorry for the late answer. Please let me know if there's anything more I can help with.

• Posts: 85Member ✭✭✭

Also: Adding -l DEBUG to the command line might turn up some more information on what's going on.

• San FranciscoPosts: 15Member

I know it's a year old, but I also came across libdrmaa issues. Aside from the excellent advice above, I would recommend the obvious check to see if you are trying to submit a Queue job from within a node on your cluster. Some clusters restrict submitting jobs that uses the libdrmaa scheduler from within a node (even if you make the path to the lib explicit with the Djava.library.path=[path to libdrmaa]. Solution is to try submitting the job from the headnode.