Holiday Notice:
The Frontline Support team will be slow to respond December 17-18 due to an institute-wide retreat and offline December 22- January 1, while the institute is closed. Thank you for your patience during these next few weeks. Happy Holidays!

Queue job submission rate slows over time

ryanabashbashryanabashbash Oak Ridge National LaboratoryMember
edited April 2014 in Ask the GATK team

Hi all,

I'm stumped by some behavior regarding job submission by Queue (although I'm not sure if the problem is related to Queue).

I wrote a QScript and tested it on a single machine running SGE; everything worked entirely as expected. I've since moved the analysis over to a cluster using UGE. Everything works mostly as expected, except for 2 issues: one I posted about on the forums here, and one where the job submissions begin taking longer over time.

When the pipeline is first started, jobs are submitted rapidly (within a few seconds of each other). Over time, jobs take increasingly longer (e.g. minutes) to submit, regardless of the job. The trend can be seen in the attached .pdf file. I can kill the pipeline, immediately restart it (both from scratch or not), and reproduce the behavior. Additionally, I can qsub the same jobs from a bash for loop without any problems.

The terminal pauses after outputting the "FunctionEdge - Output written to..." line for a minute (or more) between each submission, even with (what I think are) simple jobs:

INFO  13:07:14,093 FunctionEdge - Starting: rm 12.sam 12.sorted.bam 12.sorted.bai 12.sorted.intervals 12.sorted.realigned.bam 12.sorted.realigned.bai 
INFO  13:07:14,093 FunctionEdge - Output written to /data/ryanabashbash/src/Queue1104bDev/CleanupIntermediates.out 
INFO  13:08:08,810 DrmaaJobRunner - Submitted job id: 37461

On the single machine with SGE, these same jobs were submitted almost instantaneously, and I can qsub the same jobs from a bash for loop on the cluster just as fast. I'm guessing its some sort of interaction between Queue and the architecture that I'm overlooking? Does anyone have any suggestions on how to get some traction in tracking down the cause, or has anyone seen similar behavior before?

As an aside, my .out files all contain something similar to the following errors at the beginning of the file (I haven't been able to track down the source, and I'm not sure if it is related; it doesn't seem to affect the output of the job):

sh: module: line 1: syntax error: unexpected end of file
sh: error importing function definition for `module'

Thanks for any suggestions or pointers!

Post edited by ryanabashbash on
Tagged:

Comments

  • pdexheimerpdexheimer Member ✭✭✭✭

    No, I haven't seen a slowdown like you describe, but my cluster uses LSF. It's the responsibility of the JobRunner to actually submit the jobs, so my experience with LsfJobRunner is essentially meaningless regarding the DrmaaJobRunner. I believe that there's a level of indirection here, though - Queue submits to DRMAA which translates to SGE/UGE. I would expect the fault lies somewhere in that convoluted path, but can't really point to a specific point to debug.

    Your module error looks familiar, though. Our system uses this "module" system for loading applications into the user environment. For instance, I have to run something like module load jdk or module load cufflinks near the top of my submission scripts. I don't know the name of the system, unfortunately. At any rate, this module system is actually present in the environment as functions in the shell. If I run set from my (bash) login shell, I see a module() function in the list somewhere. My guess is that you have this software on your cluster as well, and it's not getting initialized properly on your compute nodes. This may be another Queue interaction - I vaguely remember having to manually add the initialization script to my job submission hooks some time ago. Our cluster has been upgraded since then, however, and I can't remember the exact circumstances

  • kevyinkevyin Member

    That error is a common bug with SGE 6.2u5
    http://comments.gmane.org/gmane.comp.clustering.opengridengine.user/2012

    You can try adding this at the start of the bash script
    . /etc/profile.d/modules.sh

  • ryanabashbashryanabashbash Oak Ridge National LaboratoryMember

    @pdexheimer‌ and @kevyin

    Thank you both for the input!

    No luck so far with either the module error nor the slowdown. I didn't have any success with kevyin's suggestion nor those in the posted link. As best I can tell, the shell problem regarding module seems to come from this in DrmaaJobRunner.scala (I haven't tried changing the shell it uses and recompiling though so I can't confirm, and my inexperience means I could be 100% wrong).

    // Instead of running the function.commandLine, run "sh <jobScript>"
    drmaaJob.setRemoteCommand("sh")
    drmaaJob.setArgs(Collections.singletonList(jobScript.toString))
    

    Regarding the slowdown, I first tried checking the differences in environmental variables between the fast job submission when doing the qsub from a bash for loop and the slow submission when Queue performs the submission. There are a number of differences in environmental variables between the two, and while trying to make them similar to rule out those differences, I found I could pass new environmental variables (e.g. something like):

    java -jar Queue.jar -S foo.scala --job_native_arg "-v MADEUPVAR=placeholder"

    but I could never change an existing environmental variable with this method (e.g. TERM, TMPDIR). Regardless, thanks again for the suggestions you gave.

  • jeremylp2jeremylp2 Member

    Ryan, have you had any luck fixing this? Having the same problem (Queue 3.4), and haven't found a solution. Thanks!

  • ryanabashbashryanabashbash Oak Ridge National LaboratoryMember
    edited July 2015

    Hi @jeremylp2

    No, unfortunately I was never able to track down the source. I ended up doing the dependency tracking and scatter/gathering we wanted from Queue with bash and qsub ( https://github.com/Frogee/RIG ). I'd be very interested if you find a solution or workaround.

Sign In or Register to comment.