We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

How can I make GATK tools run faster?

It sure seems like everyone has a need for speed these days. So, there are two main ways to get your analysis results faster:

  • Parallelism, which doesn't actually make the calculations faster, but makes the wait shorter from your point of view (a.k.a. "wall-clock time") by running things in parallel. For a primer on the concept of parallelism and a breakdown of available options for parallelizing GATK (multithreading with Spark and scatter-gathering with Cromwell), see this article.

  • Better hardware, which does make the calculations go faster to varying degrees depending on the tool and the hardware in question. One could fill a book with discussions about how to use different types of hardware to the best effect for data science (and someone probably has) so let's just focus on the big picture. When we're talking about achieving faster speeds through hardware upgrades, we're talking about three types of things: "generically" better hardware, "normal" hardware for which there are software optimizations available, and specialized "alphabet soup" processors like GPUs, FPGAs, and TPUs. Read this article to learn more.

Technically there's also a third option (which I think of as the turd option, personally): cut corners by skipping steps and/or compromising on quality. But that's a topic for another time, another doc...

Alright, but how should I set this up in practice?

Due to the extreme variety of infrastructure and uses cases out in the world, we don't give specific guidelines for the type and configuration of hardware setup you should use to run GATK, because that's outside the scope of what we can reasonably provide with the resources we have.

We do however share the WDL workflows that we use in production to run the GATK Best Practices pipelines. These scripts feature the parallelization strategies that we chose to implement for each pipeline, and the accompanying example input JSON files include the parameter settings for hardware resources that we use on the Google Cloud Platform. You can even run these workflows for yourself the same way we do through FireCloud. FireCloud is a secure, freely accessible cloud-based analysis portal developed at the Broad Institute. It includes preconfigured GATK Best Practices pipelines as well as tools for building your own custom pipelines (with any command line tool you want, not just GATK).

Alternatively, our collaborators at the Intel-Broad Center for Genomic Data Engineering have done a ton of benchmarking and can provide you with recommended hardware configurations for local infrastructure based on your planned usage. Let us know in the comment thread if you'd like us to introduce you.


  • mikedamourmikedamour Member

    Hi BITeam,
    I have read through and like the idea of the free cloud portal (Thanks!). Right now I don't have a project to charge GCP time, so need to use my Mac (i7/4core, plenty of RAM/SSD) for some .org cancer work. I see the HaplotypeCallerSpark beta - nice work! Any multithread work on Mutect2 for MacOSX?
    Glad to be beta on that. Any timeline?
    Best, Mike D'Amour

  • SheilaSheila Broad InstituteMember, Broadie ✭✭✭✭✭

    Hi Mike,

    There are talks of "Sparkifying" Mutect2, but it has not happened yet. Have a look at this issue ticket for more information. Perhaps if you post there, it may resurrect the discussion.


  • Oops, just noticed your reply. Thanks, Sheila. Have gotten a billing account and like using FireCloud and GCP better than waiting on my machine.
    Best, Mike D'

Sign In or Register to comment.