You can find our new documentation site and support forum for posting questions here.
How much does it cost to use FireCloud?
First, some good news: FireCloud itself won't cost you anything. We make the platform available to everyone for free in the hope that doing so will enable scientists to get work done. It's that simple.
Now the caveat: you will need to pay Google for use of their services, i.e. for compute, storage and egress (download). Some of that is relatively predictable, especially storage and egress, if you have a reasonable rule of thumb to estimate the size of the outputs that you will generate in the course of your work. In contrast, it can be harder to predict how much the compute will cost you, unless you have access to historical data for the workflows you intend to run.
We understand that dealing with all these unknowns can be stressful, so here's a detailed breakdown of what costs money in the Google Cloud Platform (GCP) and how the FireCloud development team is working to provide proactive, transparent billing information.
"Egress" refers to any activity that moves data out of Google Storage, such as downloading data to your local machine or to another cloud, viewing the contents of output files, or using IGV in the Analysis page of a workspace.
Whenever you choose explicitly to download files through FireCloud, *e.g." by clicking on filepaths in tables in the Data page or the Monitor page of a workspace, you will be reminded that this has a cost, and the preview/download dialog will provide you with an estimated cost for the download. The cost is entirely determined by the size of the file. For example, downloading a log file as shown below typically costs less than $0.01 to download; but downloading genome sequence data files can make a more noticeable dent in your wallet.
Note that if you choose to download files through the Google Cloud console or through the
gsutil command-line utility, you will not be provided with this estimate.
Something you may not realize is that using IGV (or indeed any other genome browser that can access cloud-stored data) to view your data has a cost, since FireCloud (if you're using the embedded IGV feature in the Analysis page) or the genome browser (if you're using it standalone) must retrieve the data you view out of storage, which is equivalent to downloading it as far as Google is concerned. Fortunately this typically only costs small amounts per search because the system only accesses small portions of the data -- unless of course you're scrolling through an entire genome on high zoom, which frankly seems like a questionable life decision (but we're not here to judge you).
In any case, all these costs get passed on to the billing project that is associated with the "bucket" where the data lives.
Addendum: It is completely free to upload data to the Google Cloud platform buckets, including FireCloud workspace buckets (which are just regular buckets as far as Google is concerned). It's also free to move data between Google buckets or from a Google bucket to a Google virtual machine in the same region.
Storing files on the cloud costs money, because somewhere, there's a hard drive taking up space and sucking down electricity for the sole purpose of keeping the data there. The bigger the file, you guessed it, the more it costs to store.
FireCloud provides storage cost estimates for the buckets that are associated with workspaces. Specifically, you can view the estimated monthly cost of storage for all the data that lives in a workspace's bucket, right below the Google bucket ID in the Summary page of the workspace.
In this screenshot, $3.03 is the estimated storage fee per month to host the files in this workspace's bucket. Google will charge this amount to the FireCloud billing project that was used to set up the workspace (broad-dsde-firecloud-billing).
If you're already cloud-savvy you might be aware that there are different levels of storage, with different costs, that correspond to how fast you can access your data and where it is geographically stored. FireCloud uses standard storage in multi-regional buckets, which is the most convenient but not the cheapest option. If you would like to minimize your storage costs, you can move your data to cheaper storage through the Google console -- but note that FireCloud does not have any way to automatically update the links it records for your files, so moving the data around will break those links.
Whenever you run an analysis that gets executed on the Google Cloud Engine, you will have to pay for the operating costs of the computational infrastructure -- i.e. "the compute" for short.
If you're new to the field and/or this kind of infrastructure, this is the really stressful part, because on the face of it there is no obvious way to predict what any given workflow will cost you -- and there is typically a lag of 6 to 48 hours between the time when the cloud resource is running and the moment when that activity gets billed to you.
Google will bill your usage per minute, at a rate that depends on:
- the size of the virtual machine
- amount of RAM
- number of CPUs
- type (HDD or SSD) and size of persistent disk
All of these are parameters that can be set in the runtime section of each task within the WDL of a method. If they are not set, Google will use some basic default values when determining what type of virtual machines to provision for your workflow tasks.
Naturally, the key to minimizing your costs is to request the right resources for each task -- enough for the process to run efficiently but not so much that resources get wasted. For example, there's no point requesting a machine with multiple CPUs to run a program that cannot use more than one at a time.
That being said, it's typically not trivial to determine what are the right resources for a given task, especially if there's going to be any substantial variation in the size and shape of the data that you're going to push through it. We are currently working on defining and documenting some WDL scripting techniques that make it possible to "autosize" the resource requests based the data inputs.
We're also working on two approaches to make compute costs more predictable. One is to encourage workflow developers to include estimated costs in the documentation of the methods they publish. The other is to integrate the reporting of costs per-workflow, per-submission, per-submitter and per workspace into the FireCloud billing management features.