Cromwell and WDL's off-Broadway debut
Well, no, there isn't a Cromwell/WDL musical (yet) -- it's just that I gave a talk last night in New York, at a meetup organized by Phosphorus and hosted by FirstMark. We had a great crowd, and I have to give them mad props for listening to me go on about GATK workflows and pipelining strategies for well over the allotted hour. Especially considering I've been getting over a bout of laryngitis and my voice kept oscillating between high-pitched whine and raspy whisper... There will be a video posted on YouTube in the next few days and it would be awesome if they could get someone to do a voice dub! In the meantime, my slide deck is available here.
UPDATE: And here's the video on YouTube.
My talk attempted to cover the key challenges we encounter when trying to scale up our genomic pipelines. First I went over some numbers from the Broad's Genomics Platform sequencing production, to show the kind of scale we're dealing with. "Big Data" is a term that gets thrown around a lot these days, so I wanted to make clear that we're not kidding when we say our data is of the capital-B Big variety. Having set that stage, I reviewed the key bottlenecks in the GATK Best Practices variant discovery workflows (germline and somatic short variants, and somatic copy number) to understand what strategies we can apply at the tool and workflow level to scale past them. And finally, I presented the infrastructure solutions we have been developing in the Data Sciences Platform to enable our own Ops group to run GATK Best Practices at scale in the Broad's production pipelines, as well as enable researchers everywhere to run pipelines and share methods and data on the cloud.
Cromwell and WDL are key components of this infrastructure, so they naturally played a starring role in this portion of my presentation! It's hard to summarize in just a few slides all the features offered by this pairing -- and I realize I forgot to talk about call caching (aka job avoidance), a recently added feature of Cromwell that allows us to avoid re-running tasks that have already been run on the same data as part of other submissions -- but hey, that's what the documentation and this blog are for.
Anyway, the crowd had lots of good questions and comments, both during the talk and afterward. From the various conversations I had it was obvious that pretty much everyone in the field is hitting these problems, whether it's researchers trying to run pipelines or engineers trying to set up infrastructure for research groups. Admittedly this is not a novel observation -- but it's gratifying to confirm that we're working on the right problems, and potentially, the right kind of solutions, not just for our own use cases at Broad but also for the wider community.