The Frontline Support team will be slow to respond December 17-18 due to an institute-wide retreat and offline December 22- January 1, while the institute is closed. Thank you for your patience during these next few weeks. Happy Holidays!
How can I make GATK tools run faster?
It sure seems like everyone has a need for speed these days. So, there are two main ways to get your analysis results faster:
Parallelism, which doesn't actually make the calculations faster, but makes the wait shorter from your point of view (a.k.a. "wall-clock time") by running things in parallel. For a primer on the concept of parallelism and a breakdown of available options for parallelizing GATK (multithreading with Spark and scatter-gathering with Cromwell), see this article.
Better hardware, which does make the calculations go faster to varying degrees depending on the tool and the hardware in question. One could fill a book with discussions about how to use different types of hardware to the best effect for data science (and someone probably has) so let's just focus on the big picture. When we're talking about achieving faster speeds through hardware upgrades, we're talking about three types of things: "generically" better hardware, "normal" hardware for which there are software optimizations available, and specialized "alphabet soup" processors like GPUs, FPGAs, and TPUs. Read this article to learn more.
Technically there's also a third option (which I think of as the turd option, personally): cut corners by skipping steps and/or compromising on quality. But that's a topic for another time, another doc...
Alright, but how should I set this up in practice?
Due to the extreme variety of infrastructure and uses cases out in the world, we don't give specific guidelines for the type and configuration of hardware setup you should use to run GATK, because that's outside the scope of what we can reasonably provide with the resources we have.
We do however share the WDL workflows that we use in production to run the GATK Best Practices pipelines. These scripts feature the parallelization strategies that we chose to implement for each pipeline, and the accompanying example input JSON files include the parameter settings for hardware resources that we use on the Google Cloud Platform. You can even run these workflows for yourself the same way we do through FireCloud. FireCloud is a secure, freely accessible cloud-based analysis portal developed at the Broad Institute. It includes preconfigured GATK Best Practices pipelines as well as tools for building your own custom pipelines (with any command line tool you want, not just GATK).
Alternatively, our collaborators at the Intel-Broad Center for Genomic Data Engineering have done a ton of benchmarking and can provide you with recommended hardware configurations for local infrastructure based on your planned usage. Let us know in the comment thread if you'd like us to introduce you.