Conference proceedings: Sheila's experience at AGBT 2017
As part of our job providing support to the GATK user community, our team takes turns traveling to conferences, both to learn what's going on in the field at large and to advertise the latest features of the GATK. I recently attended the Advances in Genome Biology and Technology (AGBT) general meeting in Hollywood, Florida in February. Nice time of year to go there!
When we go to conferences we often do workshops or present posters, but this time was a first: I was there to do a software demo. Well, in fact I had two demos prepared: one about using GATK4 to run commands directly on a Spark cluster, and the other about running GATK workflows on the Cloud using Google's Pipelines API.
These two demos involved a lot of material that was new to me, because my day to day work is focused on supporting "regular" use of the current version of GATK. I had a lot to learn about GATK4 and the Cloud before I could handle doing a demo on this! Though it turns out GATK4 is not so different from the current GATK, at least for the end-user. But in the Cloud part there's a lot of computer jargon involved, and lots of technical infrastructure details that I'm not used to having to think about, like VM configurations and access permissions. It can seem a bit overwhelming when you start, and I think a lot of people in the GATK user community probably feel that way too. But I found out that once you get past the jargon, it's not really that complicated. And in the demos, we try to show every step in practice, so you can easily replicate it yourself.
In the GATK4 Spark demo, we show how to set up a Spark cluster on Google Dataproc, and run a Spark-enabled version of BaseRecalibrator on it directly from your laptop terminal. In the demo it runs in 8 minutes on an exome. You can make it go faster --pretty much as fast as you want-- by giving the cluster more nodes, or more powerful nodes, when you set it up originally. The caveat is that you will end up paying Google more for the computation. It's up to you to decide the tradeoff you want to make between speed and cost. You can find a video of the demo in our YouTube demo channel.
In the other demo, we show how to run a GATK workflow written in WDL on Google Cloud. The workflow we run in this demo is the the first part of the germline Best Practices pipeline, which goes from unmapped BAMs to a HaplotypeCaller GVCF for a single sample. The demo is based on a Google tutorial that you can find here. It makes it really simple to run complete pipelines on the cloud!
At AGBT there were only four booths for software demos including ours, and as it happened our booth was right next to the bar. Great I thought, after people grab a drink, they will come right over to my booth! And we certainly got a lot of traffic. Lots of people came over with specific questions they wanted to ask, even before seeing the demos. I found that there were two frequently asked questions. Can I run GATK4 Spark commands on a different platform from Google Cloud? Short answer: yes, but you have to roll your own submission script. Have you done a cost/analysis benefit of using Spark to accelerate processing? Short answer: we're doing that this quarter and we expect to have that information in time for the GATK4 general release. That is tentatively planned for June 2017.
So how did the demos go? Well, I was mostly worried I would get questions about technical things I'm not so familiar with like YAML and JSON files and what Spark is exactly. But what I found most challenging after all was dealing with people coming and going all the time during the demo. I thought I would be doing each of the demos for a group of people, then those people would leave and I would do the demos for a new group of people. Rinse and repeat. However, with my chatty nature and the different people that stopped by at different times, I found it hard to deal with everyone all at once. Personally I prefer having a one on one conversation with one person at a time. But this was still a very useful way to spread the word about GATK4's new features and release timeline.
And I got to talk with one of our forum super-users, Johan Dahlberg! Johan came all the way from Sweden to see my demo! And, possibly also to hear about the other things going on at AGBT, but... details. Johan knows a huge amount about GATK because he set up GATK and Queue pipelining for his institute. Sometimes we actually call on him to help us answer Queue issues, because our own team no longer uses it and therefore can't help as much. So I was thrilled to meet him in person. Getting to talk to people in person is a lot more fun that interacting with people somewhat anonymously on a forum. In fact, I wish I had met more forum users there... If any of you were at AGBT, but did not come say hi, shame on you!
Finally, I want to make a shout out to the person I saw wearing a shirt that said “Must love GATK”. I was so in awe that I did not get a chance to properly introduce myself and get a picture! If you happen to be that person and are reading this, please say something!