We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

Run the germline GATK Best Practices Pipeline for $5 per genome

ebanksebanks Broad InstituteMember, Broadie, Dev ✭✭✭✭
edited March 2018 in Announcements

By Eric Banks, Director, Data Sciences Platform at the Broad Institute

Last week I wrote about our efforts to develop a data processing pipeline specification that would eliminate batch effects, in collaboration with other major sequencing centers. Today I want to share our implementation of the resulting "Functional Equivalence" pipeline spec, and highlight the cost-centric optimizations we've made that make it incredibly cheap to run on Google Cloud.

For a little background, we started transitioning our analysis pipelines to Google Cloud Platform in 2016. Throughout that process we focused most of our engineering efforts on bringing down compute cost, which is the most important factor for our production operation. It's been a long road, but all that hard work really paid off: we managed to get the cost of our main Best Practices analysis pipeline down from about $45 to $5 per genome! As you can imagine that kind of cost reduction has a huge impact on our ability to do more great science per research dollar -- and now, we’re making this same pipeline available to everyone.

The Best Practices pipeline I'm talking about is the most common type of analysis done on a 30x WGS: germline short variant discovery (SNPs and indels). This pipeline covers taking the data from unmapped reads all the way to an analysis-ready BAM or CRAM (i.e. the part covered by the Functional Equivalence spec), then either a single-sample VCF or an intermediate GVCF, plus 15 steps of quality control metrics collected at various points in the pipeline, totalling $5 in compute cost on Google Cloud. As far as I know this is the most comprehensive pipeline available for whole-genome data processing and germline short variant discovery (without skimping on QC and important cleanup steps like base recalibration).

Let me give you a real-world example of what this means for an actual project. In February 2017, our production team processed a cohort of about 900 30x WGS samples through our Best Practices germline variant discovery pipeline; the compute costs totalled $12,150 or $13.50 per sample. If we had run the version of this pipeline we had just one year prior (before the main optimizations were made), it would have cost $45 per sample; a whopping $40,500 total! Meanwhile we've made further improvements since February, and if we were to run this same pipeline today, the cohort would cost only $4,500 to analyze.

2016 2017 Today
# of Whole Genomes Analyzed 900 900 900
Total Compute Cost $40,500 $12,150 $4,500
Cost per Genome Analyzed $45 $13.50 $5

For the curious, the most dramatic reductions we saw came from using different machine types for each of the various tasks (rather than piping data between tasks), leveraging GCP’s preemptible VMs, and most recently incorporating NIO to minimize the amount of data localization involved. You can read more about these approaches on Google's blog. At this point the single biggest culprit for cost in the pipeline is BWA (the genome mapper), a problem which its author Heng Li is actively working to address through a much faster (but equivalently accurate) mapper. Once Heng's new mapper is available, we anticipate the cost per genome analyzed to drop below $3.

On top of the low cost of operating the pipeline, the other huge bonus we get from running this pipeline on the cloud is that we can get any number of samples done in the time it takes to do just one, due to the staggeringly elastic scalability of the cloud environment. Even though it takes a single genome 30 hours to run through the pipeline (and we're still working on speeding that up), we're able to process genomes at a rate of one every 3.6 minutes, and we've been averaging about 500 genomes completed per day.

We're making the workflow script for this pipeline available in Github under an open-source license so anyone can use it, and we're also providing it as a preconfigured pipeline in FireCloud, the pipelining service we run on Google Cloud. Anyone can access FireCloud for free, you just need to pay Google for any compute and storage costs you incur when running the pipelines. So to be clear, when you run this pipeline on your data in FireCloud, all $5 of compute costs will go directly to the cloud provider; we won't make any money off of it. And there are no licensing fees involved at any point!

As a cherry on the cake, our friends at Google Cloud Platform are sponsoring free credits to help first-time users get started with FireCloud: the first 1,000 applicants can get $250 worth of credits to cover compute and storage costs. You can learn more here on the FireCloud website if you're interested.

Of course, we understand that not everyone is on Google Cloud, so we are actively collaborating with other cloud vendors and technology partners to expand the range of options for taking advantage of our optimized pipelines. For example, the Chinese cloud giant Alibaba Cloud is developing a backend for Cromwell, the execution engine we use to run our pipelines. And it's not all cloud-centric either; we are also collaborating with our long-time partners at Intel to ensure our pipelines can be run optimally on on-premises infrastructure without compromising on quality.

In conclusion, this pipeline is the result of two years' worth of hard work by a lot of people, both on our team and on the teams of the institutions and companies we collaborate with. We're all really excited to finally share it with the world, and we hope it will make it easier for everyone in the community to get more mileage out of their research dollars, just like we do.

Post edited by Geraldine_VdAuwera on


  • jaideepjoshijaideepjoshi Member
    edited April 2018

    Great Blog. Great info. I ran the pipeline in FireCloud using the NA12878 (small) successfully. If I want to run the same pipeline on my own infrastructure in-house, I am assuming I could export the WDL from FireCloud and modify it for my environment, however is there a way I can get the same sample input data, including the (small) BAM, that the pipeline uses when run in FireCloud? Thanks again.

  • ebanksebanks Broad InstituteMember, Broadie, Dev ✭✭✭✭

    Yes, @jaideepjoshi, you can export everything to your local environment. I'd highly recommend that you consider using Cromwell as your execution manager, since then you can just use the WDL without having to rewrite it from scratch. Cromwell (https://github.com/broadinstitute/cromwell) supports various local backends.
    The input data (both the example dataset as well as the resources used in the pipeline) are all available in public Google cloud buckets, so you can just download them (ideally using Google's 'gsutil' tool).
    Good luck!

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Also, the WDL scripts live in a github repo if you want to subscribe to that to be able to pull updates directly from there. I think we link to the github location in the FC workspace — if not yet we will start doing so.

    (Tagging @bshifaw who manages this content)

    Note also that the version in FC is optimized for Google so if you want to run locally some things may not work (this is a temporary limitation), but we have a universal version that can run locally out of the box. And Intel makes a version that is optimized for running on local infrastructure. If you tell us more about what infrastructure you’re using we may be able to give you more specific advice.

  • jaideepjoshijaideepjoshi Member
    edited April 2018

    Thanks ebanks/Geraldine_VdAuwera. I got the pipeline to work on a single Centos server, using the input files (downloaded), Cromwell and the docker images specific in the WDL. There were quite a few modifications I had to make to the WDLs to run, for example changing String to File for the (local) input files. Also, even though the pipeline is running local, there is something in the WDL that makes it necessary to do hash lookups of the docker images. This makes me unable to run behind a proxy.

    The next step is to run this on a Spark Cluster.

    The question is: Can I run the pipeline using Cromwell and WDL on a Spark Cluster ? I DO NOT want to run the Spark Tools. I simply want to run the entire pipeline "spark-submit cromwell-.jar run *.wdl --inputs *.json" as a Spark Job ? Is that possible ? What would I have to change in the Cromwell-.jar file to make it happen?

    Post edited by jaideepjoshi on

    Issue · Github
    by Sheila

    Issue Number
    Last Updated
    Closed By
  • jaideepjoshijaideepjoshi Member
    edited April 2018

    I think I cant just run spark-submit cromwell-.jar ...

    Post edited by jaideepjoshi on
  • SheilaSheila Broad InstituteMember, Broadie ✭✭✭✭✭


    I am asking someone from the team for some help and will get back to you asap.


  • SheilaSheila Broad InstituteMember, Broadie ✭✭✭✭✭

    Hi again,

    From the developer:

    "You can currently run individual tasks within a WDL on a Spark cluster (by running the Spark-based GATK tools), but it's not possible to run an entire WDL on a Spark cluster unless cromwell were to implement a Spark-based backend."


  • @bshifaw , @ebanks. I have successfully run the five-dollar-pipeline in Firecloud several times. The documentation has enabled me to run the same on-prem as well. In trying to get to an apple-apples comparison, is there really a way to understand the (virtual) cpu, memory, etc. footprint of the GCP instance that finishes the WGS pipeline in @ 21 hrs ?
  • bshifawbshifaw Member, Broadie, Moderator admin

    Hi @jaideepjoshi ,

    What do you mean by "is there a way to understand" GCP instance resources?
    The GCP instance that cromwell creates is specified in the runtime parameters of the wdl script, this includes cpu, memory, and disk space.
    If you're interested the machine that's being used to run a task, you can view the operations log produced by cromwell to determine the machine type (e.g. n1-standard-1) used to run the workflow and check google clouds document that describe that machine type here .

    Example of the link to the operation log in Firecloud located in the monitor tab of an executed workflow:

  • matdmsetmatdmset GhentMember


    Could you include a link to Heng Li's new mapper? I'd be interested to follow up on that one.

Sign In or Register to comment.