Queue: possible to set max number of a certain kind of job?

Let's say I have a case class in a qscript like so:

case class getFileFromRemote(remoteFile:File, localFile:file) extends ExternalCommonArgs with SixteenCoreJob with FourDayJob{
    @Input(doc = "remote") val _remoteFile = remoteFile
    @Output(doc = "local") val _localFile = localFile
    def commandLine = "fetchFile.sh " + remoteFile + " " + localFile
    this.isIntermediate = true
    this.jobName = "fetchFile"
    this.analysisName = this.jobName
} 

Then I can add it to my script() like so:

add( getFileFromRemote("[email protected]:/path/to/remote", "path/to/local") )

All is well so far.

Lets say that I have lots of files to fetch, so I add jobs in a for loop over a Seq of file names. I then add jobs downstream jobs as usual. The problem that I run in to is that all 1000+ fetchFile.sh (which uses irods/irsync behind the scenes) sessions will start at the same time, choking the system and nothing will get downloaded.

One solution to my problem would be to be able to set a limit in my fetcher case class, to tell Queue to never submit more than 5 (or so) of these jobs to the cluster. Is that possible, or can anyone see another way around this?

Best Answer

Answers

  • The "correct" way to handle this is outside of Queue, in the job scheduler. On my cluster, LSF will happily enqueue 1000 jobs, and only run them as cores become available. If I wanted to further restrict the number of concurrent jobs, I'd make a new LSF queue that only ran so many jobs.

    To hack Queue to do this, you'd probably have to add code into the loop that moves jobs from pending to executing that would only allow n jobs to execute at once. But that's not the intended behavior, because Queue assumes that it's feeding jobs into a scheduler of some sort

  • @pdexheimer‌ Thanks, and sorry for being unclear.

    Our backend job scheduler is SLURM, which works very well with Drmaa. 1000+ jobs in SLURM is of course no problem. It's not the SLURM system that's being choked (sorry for being unclear), it's system fetching the files. Even using rsync, I suspect that 1000+ parallell connection might not work very well, if at all.

    The issue that my pipeline fetches files from on demand, and when I start the pipeline, there are ≈1500 files to be fetched, each around 5Gb. No, I don't run it from start that often with all samples...

    What I would like to do is to limit only the getFileFromRemote jobs to a max number to be submitted to the queue.

    Maybe something like

    this.maxNumberOfJobsOfThisClass = 5
    

    in the case class def is what I'm after.

    All jobs of other kinds (alignment, adapter trimming, base recal, realn, etc etc) should be allowed to be submitted to the cluster however many times Queue pleases. As it is now, I have to fetch all files, wait for that to finish fetching every single file and then kick of the pipeline.

    I've been thinking about having one job that fetches all the files, that the rest of the pipeline would depend on, but that would mean that all other jobs would have to wait for the one master fetch job, which just seems.... inelegant, in such a nice pipelining system. ;)

    I understand that such an option is not available, and might be outside of the scope of Queue. Any ideas on workarounds are welcome.

  • Thanks for the suggestions @pdexheimer, I'll try 'em both I think.

Sign In or Register to comment.