Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

How Actively Used is Queue?

I have a feeling this question doesnt have a straight answer, but I'm curious: how actively used is queue at the Broad as of 2015, and can anyone speak to whether the Broad views queue as part of the long term picture?

Based on what I read, queue is really the only documented way to parallelize tasks, since a lot of tools dont have multi-threading options. On paper, queue look like it offers a pretty nice looking set of features. There doesnt appear to be a terribly active user base on this site around queue and most docs are a few years old. That is generally not a good sign; however, the fact that you need some non-trivial setup and a cluster to really take advantage of this probably accounts for at least some of that. I'm pretty interested in seeing whether we can implement it internally, but I am hoping to get some sense of how actively used the platform is first.

Thanks for any insight.

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @bbimber,

    I'll try to give you the straightest answer I can, though there will be some fuzzy bits on the edges of the map.

    The methods development team uses Queue daily; it's a major part of our setup. We rarely ever run GATK "manually". There are also a few other projects that we know of that use Queue in a big way, like GenomeStrip.

    The production pipeline at Broad does not run on Queue; it has its own dedicated infrastructure. Most of the research community at Broad that uses GATK and related tools (such as MuTect) also doesn't use queue; instead they use a self-service pipelining system called Firehose.

    Out in the wild, we know of a few super-users who use Queue for pipelining purposes, but it's not very widespread in the user community as far as we can tell. The learning curve is generally regarded as steep and our Queue docs are pretty unhelpful; sparse in general, some out of date, and old -- but mostly because very little has changed in a long time. This is because the tool is mature, not abandoned, but you're right that lack of movement is usually not encouraging. Mostly in this case it denotes the fact that providing active support for Queue has never been a priority, and to be brutally honest, it's unlikely that it will become a priority anytime soon. There is some chatter about providing better example scripts, which should help, but I don't foresee this happening particularly soon.

    We certainly plan to keep using and maintaining Queue for the foreseeable future, but in this field that isn't saying all that much.

    So, in a nutshell, yes, Queue is the recommended way of pipelining GATK short of building your own dedicated infrastructure. In terms of software maturity, it's pretty darn solid. But support and documentation are sorely lacking, so it's a "roll your own if you can dig it" kind of situation.

  • bbimberbbimber HomeMember

    Thanks. Can I assume these other internal solutions you mention are going to maintain internal and we shouldnt expect to see anything else become a better alternative than queue?

  • bbimberbbimber HomeMember

    also - can you comment on why internal groups went with these other options? on paper queue's functionality looks like exactly what one needs. are you willing to comment on any shortcomings broad staff found? i assume docs didnt put them off...

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    For the foreseeable future, yes. But for full disclosure, we are exploring options to make GATK more cloud-friendly, which may eventually lead to different solutions becoming better alternatives somewhere down the line. That's probably a year+ horizon though. If you want to get any work done in 2015 you're better off getting started with Queue :)

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Re: your followup question, why other internal groups at Broad went for different solutions... Beyond the usual organizational silo issues (which explain e.g. Picard / GATK being developed separately in the past), there are two main reasons. In the case of the production pipeline, the ops team found that Queue lacked production-level features like interactive pipeline management, change tracking, status messaging and so on. For the research community, it's pretty much the opposite -- Queue was seen as too difficult for a lot of the scientists who don't have the kind of scripting experience that Queue favors.

  • bbimberbbimber HomeMember

    very helpful, thank you.

  • ymcymc Member

    I am using Queue. I think it is still useful even if you only have one multi-core workstation. It can speed up IndelRealignment and HaplotypeCaller significantly.

    The documentation obviously is inadequate but it is still possible to figure it out by reading some scripts you can download from the internet. However, I can see many scientists who are not from a progamming background might find learning a new language called Scala not worth the effort.

  • bbimberbbimber HomeMember

    could you say a little more about the single machine scenario? my read from other threads was that if you have a single machine you have no parallelization of jobs. do you have a custom jobRunner or something along these lines?

  • Johan_DahlbergJohan_Dahlberg Member ✭✭✭
    edited March 2015

    I could pitch in here since I guess we are one of those super users. Here at the National Genomics Infrastructure at SciLifeLab in Sweden, we are using Queue to setup our production pipelines for human whole genome sequencing data, and we expect to be ramping up to processing about 10 000 genomes per year on it once everything is 100% operational.

    As for the single machine scenario we have solved it by writing a custom jobRunner (see a previous discussion on the topic here: http://gatkforums.broadinstitute.org/discussion/5198/creating-a-parallelshelljobrunner-for-queue#latest), you can find the code here at the moment: https://github.com/NationalGenomicsInfrastructure/piper/tree/master/src/main/scala/molmed/queue/engine/parallelshell I might clean it up later and submit a pull request to the GATK team once I get the time, and see if they are interested.

    For scaling up with Queue we've found so far that it's struck a nice balance between simplicity and robustness. However I think that there are other initiatives out there that might give you approximately the same advantages. And for the long term I think that the looks interesting as it would separate the workflow description from the actual runner implementation: https://github.com/common-workflow-language/common-workflow-language Though I think it's rather early days there as of now

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    We'll take any pull request from you, @Johan_Dahlberg :)

    Seriously, the single-machine support sounds like a cool feature.

  • bbimberbbimber HomeMember

    hi guys,

    that's a great tip on ParallelShellJobRunner - it will prove very useful.

    I have a more general question on Queue. I'm interested in a very basic use case. I want to run either HaplotypeCaller or IndelRealigner and use Queue to parallelize the task. All I really need is to take the arguments I would normally provide to the GATK tool and basically have a pass-though. This step is part of a larger pipeline. I'm considering:

    1) Have my code write out one-off Queue scripts that hard-code all the parameters/filepaths I need.

    2) I could create my own reusable queue script that declares @Arguments and essentially wraps the standard tool. On the one hand this is more elegant, but it also feels oddly circular. My wrapper would basically just duplicate the arguments accepted by the primary tool. By any chance is there a better way to handle this scenario? Is there any sort of code to more automatically let me invoke individual GATK tools from queue, passing their native arguments on my queue command line?

  • pdexheimerpdexheimer Member ✭✭✭✭

    I like that idea of passing arguments untouched into the underlying tools, it's really interesting and I could see how it would help this use case. It would be a non-trivial change, though, because the Queue CommandLIneFunctions bypass the argument parser completely. I would argue that this is a good thing for two reasons:

    1) Any errors in specifying the arguments are caught when the qscript is compiled, rather than at some point during the pipeline execution
    2) When you build a complex pipeline with queue, you would have to decide which of the command line arguments go to which of the steps in the pipeline, which would mean you'd have to parse the command line in your qscript, which would be messy and duplicate a whole bunch of work.

    All things considered, I feel that your option 2 is the best choice - you'll probably only have a handful of arguments to those tools that you'll actually use

  • bbimberbbimber HomeMember

    Thanks for the reply. I agree that you dont want to pass through arguments in most cases. What I was thinking was something analogous to what happens with GATKExtensionsGenerator. ie. you have some code the iterates the available walkers and more automatically makes a wrapper available. Same idea as #2, except without the redundant work. i have not really wrapped my head around how scala actually works, but i may dig into this a little more.

Sign In or Register to comment.