Bug Bulletin: The GenomeLocPArser error in SplitNCigarReads has been fixed; if you encounter it, use the latest nightly build.

Queue with Grid Engine

Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,274Administrator, GATK Developer admin
edited February 3 in Queue

1. Background

Thanks to contributions from the community, Queue contains a job runner compatible with Grid Engine 6.2u5.

As of July 2011 this is the currently known list of forked distributions of Sun's Grid Engine 6.2u5. As long as they are JDRMAA 1.0 source compatible with Grid Engine 6.2u5, the compiled Queue code should run against each of these distributions. However we have yet to receive confirmation that Queue works on any of these setups.

Our internal QScript integration tests run the same tests on both LSF 7.0.6 and a Grid Engine 6.2u5 cluster setup on older software released by Sun.

If you run into trouble, please let us know. If you would like to contribute additions or bug fixes please create a fork in our github repo where we can review and pull in the patch.

2. Running Queue with GridEngine

Try out the Hello World example with -jobRunner GridEngine.

java -Djava.io.tmpdir=tmp -jar dist/Queue.jar -S public/scala/qscript/examples/HelloWorld.scala -jobRunner GridEngine -run

If all goes well Queue should dispatch the job to Grid Engine and wait until the status returns RunningStatus.DONE and "hello world should be echoed into the output file, possibly with other grid engine log messages.

See QFunction and Command Line Options for more info on Queue options.

3. Debugging issues with Queue and GridEngine

If you run into an error with Queue submitting jobs to GridEngine, first try submitting the HelloWorld example with -memLimit 2:

java -Djava.io.tmpdir=tmp -jar dist/Queue.jar -S public/scala/qscript/examples/HelloWorld.scala -jobRunner GridEngine -run -memLimit 2

Then try the following GridEngine qsub commands. They are based on what Queue submits via the API when running the HelloWorld.scala example with and without memory reservations and limits:

qsub -w e -V -b y -N echo_hello_world \
  -o test.out -wd $PWD -j y echo hello world

qsub -w e -V -b y -N echo_hello_world \
  -o test.out -wd $PWD -j y \
  -l mem_free=2048M -l h_rss=2458M echo hello world

One other thing to check is if there is a memory limit on your cluster. For example try submitting jobs with up to 16G.

qsub -w e -V -b y -N echo_hello_world \
  -o test.out -wd $PWD -j y \
  -l mem_free=4096M -l h_rss=4915M echo hello world

qsub -w e -V -b y -N echo_hello_world \
  -o test.out -wd $PWD -j y \
  -l mem_free=8192M -l h_rss=9830M echo hello world

qsub -w e -V -b y -N echo_hello_world \
  -o test.out -wd $PWD -j y \
  -l mem_free=16384M -l h_rss=19960M echo hello world

If the above tests pass and GridEngine will still not dispatch jobs submitted by Queue please report the issue to our support forum.

Post edited by Geraldine_VdAuwera on

Geraldine Van der Auwera, PhD

Comments

  • delagoyadelagoya Posts: 1Member

    You should use h_vmem instead of or along with mem_free for the qsub submission examples above. mem_free only checks memory usage at the time of first entering running status, which is OK for short-lived processes, but not for long-lived ones, where memory usage can grow over time.

    E.g. qsub -l h_vmem=16G,mem_free=16G ...

  • yfarjounyfarjoun Broad InstitutePosts: 15GATK Developer mod
    edited May 2013

    You mean to use public/scala/qscript/org/broadinstitute/sting/queue/qscripts/examples/HelloWorld.scala right?

    Post edited by yfarjoun on
  • redzengenoistredzengenoist Posts: 27Member
    edited February 3

    Hello there,

    I've got an issue running scatter-gather on gridengine 6.2u5, redhat.

    When I first ran it, it reported libdrmaa.so missing, so I did a clusterwide search, and found the admins version libdrmaa.so. That meant that I could finally run basic hello world scripts, such as the below:

    `$    java -Djava.io.tmpdir=$temp \
           -jar $queu -jobRunner GridEngine \
           -S $home/QUEUETools/newest/resources/ExampleUnifiedGenotyper.scala \
           -R $home/QUEUETools/newest/resources/exampleFASTA.fasta \
           -I $home/QUEUETools/newest/resources/exampleBAM.bam -run
    
    
    `INFO  18:31:07,505 QScriptManager - Compiling 1 QScript
    INFO  18:31:13,574 QScriptManager - Compilation complete
    INFO  18:31:13,697 HelpFormatter - ----------------------------------------------------------------------
    INFO  18:31:13,697 HelpFormatter - Queue v2.7-2-g6bda569, Compiled 2013/08/28 16:33:34
    INFO  18:31:13,697 HelpFormatter - Copyright (c) 2012 The Broad Institute
    INFO  18:31:13,697 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk
    INFO  18:31:13,698 HelpFormatter - Program Args: -jobRunner GridEngine -S /xxx/QUEUETools/newest/resources/ExampleUnifiedGenotyper.scala -R 
    /xxx/QUEUETools/newest/resources/exampleFASTA.fasta -I 
    /xxx/QUEUETools/newest/resources/exampleBAM.bam -run
    INFO  18:31:13,698 HelpFormatter - Date/Time: 2014/02/03 18:31:13
    INFO  18:31:13,698 HelpFormatter - ----------------------------------------------------------------------
    INFO  18:31:13,699 HelpFormatter - ----------------------------------------------------------------------
    INFO  18:31:13,708 QCommandLine - Scripting ExampleUnifiedGenotyper
    INFO  18:31:13,844 QCommandLine - Added 2 functions
    INFO  18:31:13,844 QGraph - Generating graph.
    INFO  18:31:13,872 QGraph - Generating scatter gather jobs.
    INFO  18:31:13,903 QGraph - Removing original jobs.
    INFO  18:31:13,907 QGraph - Adding scatter gather jobs.
    INFO  18:31:14,688 QGraph - Regenerating graph.
    INFO  18:31:14,706 QGraph - Running jobs.
    INFO  18:31:15,322 QGraph - 0 Pend, 0 Run, 0 Fail, 7 Done
    INFO  18:31:16,379 QCommandLine - Writing final jobs report...
    INFO  18:31:16,380 QJobsReporter - Writing JobLogging GATKReport to file /xxx/QUEUETools/Queue_2.7.2/resources/ExampleUnifiedGenotyper.jobreport.txt
    INFO  18:31:16,635 QJobsReporter - Plotting JobLogging GATKReport to file /xxx/QUEUETools/Queue_2.7.2/resources/ExampleUnifiedGenotyper.jobreport.pdf
    WARN  18:31:16,648 RScriptExecutor - Skipping: Rscript (resource)org/broadinstitute/sting/queue/util/queueJobReport.R /xxx/QUEUETools/Queue_2.7.2/resources/ExampleUnifiedGenotyper.jobreport.txt /xxx/QUEUETools/Queue_2.7.2/resources/ExampleUnifiedGenotyper.jobreport.pdf
    INFO  18:31:16,655 QCommandLine - Script completed successfully with 7 total jobs`
    

    So, that's fine.

    However, when I try to run basically the same script on actual BAM files, I get this error:

    `$       java -Djava.io.tmpdir=$temp \
           -jar $queu -jobRunner GridEngine \
           -S $home/QUEUETools/newest/resources/ExampleUnifiedGenotyper.scala \
           -R $dxfa \
           -I $gatr/bamlists/currentrecalbams.test2.list -run`  
    
    
    blabla
    
    INFO  18:36:22,307 QGraph - Generating scatter gather jobs.
    INFO  18:36:22,338 QGraph - Removing original jobs.
    INFO  18:36:22,341 QGraph - Adding scatter gather jobs.
    INFO  18:36:23,164 QGraph - Regenerating graph.
    INFO  18:36:23,200 QGraph - Running jobs.
    INFO  18:36:27,499 FunctionEdge - Starting: LocusScatterFunction: List(/share/XFS0016/gata/bamlists/currentrecalbams.test2.list, /ifshk7/ST_PG/PMO/SZY11098/indx/GATKh19bundle/ucsc.hg19.fasta) > List(/ifshk5/PC_HUMAN_AP/PMO/SZY11098_HUMbjjR/QUEUETools/Queue_2.7.2/resources/.queue/scatterGather/ExampleUnifiedGenotyper-1-sg/temp_1_of_3/scatter.intervals, /ifshk5/PC_HUMAN_AP/PMO/SZY11098_HUMbjjR/QUEUETools/Queue_2.7.2/resources/.queue/scatterGather/ExampleUnifiedGenotyper-1-sg/temp_2_of_3/scatter.intervals, /ifshk5/PC_HUMAN_AP/PMO/SZY11098_HUMbjjR/QUEUETools/Queue_2.7.2/resources/.queue/scatterGather/ExampleUnifiedGenotyper-1-sg/temp_3_of_3/scatter.intervals)
    INFO  18:36:27,499 FunctionEdge - Output written to /ifshk5/PC_HUMAN_AP/PMO/SZY11098_HUMbjjR/QUEUETools/Queue_2.7.2/resources/.queue/scatterGather/ExampleUnifiedGenotyper-1-sg/scatter/scatter.out
    INFO  18:36:28,067 QGraph - 6 Pend, 1 Run, 0 Fail, 0 Done
    INFO  18:36:58,383 FunctionEdge - Done: LocusScatterFunction: List(/share/XFS0016/gata/bamlists/currentrecalbams.test2.list, /ifshk7/ST_PG/PMO/SZY11098/indx/GATKh19bundle/ucsc.hg19.fasta) > List(/ifshk5/PC_HUMAN_AP/PMO/SZY11098_HUMbjjR/QUEUETools/Queue_2.7.2/resources/.queue/scatterGather/ExampleUnifiedGenotyper-1-sg/temp_1_of_3/scatter.intervals, /ifshk5/PC_HUMAN_AP/PMO/SZY11098_HUMbjjR/QUEUETools/Queue_2.7.2/resources/.queue/scatterGather/ExampleUnifiedGenotyper-1-sg/temp_2_of_3/scatter.intervals, /ifshk5/PC_HUMAN_AP/PMO/SZY11098_HUMbjjR/QUEUETools/Queue_2.7.2/resources/.queue/scatterGather/ExampleUnifiedGenotyper-1-sg/temp_3_of_3/scatter.intervals)
    INFO  18:36:58,387 QGraph - Writing incremental jobs reports...
    INFO  18:36:58,388 QJobsReporter - Writing JobLogging GATKReport to file /ifshk5/PC_HUMAN_AP/PMO/SZY11098_HUMbjjR/QUEUETools/Queue_2.7.2/resources/ExampleUnifiedGenotyper.jobreport.txt
    INFO  18:36:58,610 FunctionEdge - Starting:  'java'  '-Xmx2048m'  '-XX:+UseParallelOldGC'  '-XX:ParallelGCThreads=4'  '-XX:GCTimeLimit=50'  '-XX:GCHeapFreeLimit=10'  '-Djava.io.tmpdir=/share/XFS0016/temp'  '-cp' '/xxx/QUEUETools/newest/Queue.jar'  'org.broadinstitute.sting.gatk.CommandLineGATK'  '-T' 'UnifiedGenotyper'  '-I' '/share/XFS0016/gata/bamlists/currentrecalbams.test2.list'  '-L' '/ifshk5/PC_HUMAN_AP/PMO/SZY11098_HUMbjjR/QUEUETools/Queue_2.7.2/resources/.queue/scatterGather/ExampleUnifiedGenotyper-1-sg/temp_1_of_3/scatter.intervals'  '-R' '/ifshk7/ST_PG/PMO/SZY11098/indx/GATKh19bundle/ucsc.hg19.fasta'  '-o' '/ifshk5/PC_HUMAN_AP/PMO/SZY11098_HUMbjjR/QUEUETools/Queue_2.7.2/resources/.queue/scatterGather/ExampleUnifiedGenotyper-1-sg/temp_1_of_3/currentrecalbams.test2.listunfiltered.vcf'
    INFO  18:36:58,611 FunctionEdge - Output written to /ifshk5/PC_HUMAN_AP/PMO/SZY11098_HUMbjjR/QUEUETools/Queue_2.7.2/resources/.queue/scatterGather/ExampleUnifiedGenotyper-1-sg/temp_1_of_3/currentrecalbams.test2.listunfiltered.vcf.out
    
    **ERROR** 18:36:58,890 Retry - Caught error during attempt 1 of 4.
    org.broadinstitute.sting.queue.QException: Unable to submit job: error: no suitable queues
    
    blablabla
    

    I know what value to enter in the queue field: the default queue test-command gives the same error:

                `qsub -w e -V -b y -N echo_hello_world -l vf=4G -o test.out -wd $PWD -j y echo hello world`
                Unable to run job: error: no suitable queues.
                Exiting.
    

    Which I can thusly correct:

       `qsub -w e -V -b y -N echo_hello_world -l vf=5G -q st.q -P st_pg vf=4G -o test.out -cwd -j y echo hello world`
       Your job 990540 ("echo_hello_world") has been submitted
    

    My question is, how do I edit the default parameters of drmaa / queue, to use my desired -q parameter? I can't edit .so files, it seems.

    Post edited by redzengenoist on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,274Administrator, GATK Developer admin

    Hi there,

    We don't work with DRMAA so I can't help you, but perhaps one of our resident superusers such as @pdexheimer or @Johan_Dahlberg will be able to jump in with an answer.

    Geraldine Van der Auwera, PhD

  • pdexheimerpdexheimer Posts: 356Member, GSA Collaborator ✭✭✭

    There's a global -jobQueue argument (i.e., java -jar Queue.jar -s script.scala -jobQueue st.q …), but it looks like the DRMAA runner never uses it. Unfortunately, I don't know anything about DRMAA either, so I don't know exactly how to make the fix

  • thibaultthibault Posts: 19GATK Developer mod

    As a workaround you can try Queue's --jobNative argument (or the equivalent QFunction property .jobNativeArgs) to pass arguments directly to DRMAA.

    Joel Thibault ~ Software Engineer ~ GSA ~ Broad Institute

  • Johan_DahlbergJohan_Dahlberg Posts: 85Member ✭✭✭

    Yes. I can second @thibaults solution. However, it depends on then drmaa specification if it will work or not since they seem to handle the jobNative arguments quite differently.

  • redzengenoistredzengenoist Posts: 27Member

    That sounds very promising, actually.

    I've narrowed it down, such that I actually will not need the qsub -q argument, all that I need is a -P argument (qsub -P st_pg).

    However, I'm not sure how to syntax native_arg. When I write it like this:

    java -Djava.io.tmpdir=$temp -jar $queu -jobRunner GridEngine -S ExampleCountReads.scala -R exampleFASTA.fasta -I exampleBAM.bam -jobNative -P st_pg -memLimit 2 -run

    I get this:

    INFO  14:20:11,924 QScriptManager - Compiling 1 QScript
    INFO  14:20:17,726 QScriptManager - Compilation complete
    ##### ERROR ------------------------------------------------------------------------------------------
    ##### ERROR stack trace
    org.broadinstitute.sting.commandline.InvalidArgumentException:
    Argument with name 'P' isn't defined.
            at org.broadinstitute.sting.commandline.ParsingEngine.validate(ParsingEngine.java:303)
            at org.broadinstitute.sting.commandline.ParsingEngine.validate(ParsingEngine.java:276)
            at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:213)
            at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:152)
            at org.broadinstitute.sting.queue.QCommandLine$.main(QCommandLine.scala:62)
            at org.broadinstitute.sting.queue.QCommandLine.main(QCommandLine.scala)
    ##### ERROR ------------------------------------------------------------------------------------------
    ##### ERROR A GATK RUNTIME ERROR has occurred (version 2.7-2-g6bda569):
    ##### ERROR
    ##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
    ##### ERROR If not, please post the error message, with stack trace, to the GATK forum.
    ##### ERROR Visit our website and forum for extensive documentation and answers to
    ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk
    ##### ERROR
    ##### ERROR MESSAGE: Argument with name 'P' isn't defined.
    ##### ERROR ------------------------------------------------------------------------------------------
    

    Can anybody guess what the argument format is supposed to be?

  • redzengenoistredzengenoist Posts: 27Member
    edited February 4

    Ah - I continued to play with it, and I just had to format the argument as a string:

    -jobNative "-P st_pg -l vf=6G etc etc etc"

    Thanks to @thibault and @Johan_Dahlberg, you guys are brilliant. I've got queue working and the pertinent jobs submitted.

    Post edited by redzengenoist on
  • DavidRiesDavidRies Posts: 11Member

    Hi, queue in general works fine for me on the GridEngine. There is a little performance tweak I would like to suggest. At the moment, the GridEngineJobRunner.scala forces "the remote environment to inherit local environment settings". That might be a goo idea in general, to make sure the jobs get all they need, but with hundreds of clustered jobs, this unnecessarily slows down the system. I'm not much of a scala programmer (yet), so I don't see a way to turn the -V flag off, other than doing it manually in the source code and compile the whole thing. A nice thing would be the possibility to set the inheritance to false. Maybe @pdexheimer or @Johan_Dahlberg know a solution?

  • pdexheimerpdexheimer Posts: 356Member, GSA Collaborator ✭✭✭

    @DavidRies‌ - As you suggest, the -V parameter is always set for GridEngine jobs. You're right, at the moment you'd have to remove it in the code and recompile Queue to get rid of it.

    The solution would be to add another argument to QSettings, then conditionally add -V to nativeSpec depending on the contents of that argument. However, adding a runner-specific argument to the global QSettings wouldn't be great - it should really be something applicable to any runner in general. I'm not certain exactly what -V does (beyond what's in the comment, of course), so I'm not sure if it's an easily generalizable concept

Sign In or Register to comment.