Cromwell I/O: Highway to the Cloud
Cromwell 26 has a new way to deal with I/O (Input/Output) operations. This new approach reduces network load and provides better control over the resources allocated throughout the system. This blogpost will describe some of the optimizations and how they improve Cromwell stability and reliability.
Time to grow up
Young and frivolous Cromwell had a pretty easy life as a workflow engine. Workflows were scarce and not that big. Now that it’s getting a bit wiser older, expectations grow as well with more workflows and larger workflows. This creates challenges in different areas, including in file handling. When running a workflow, Cromwell needs to manipulate potentially a great number of input, output and other files. Factors such as how many calls are in the workflow, how wide scatters are and using call-caching can influence the number of files Cromwell must handle. In contrast, the number of different types of operations Cromwell needs to know how to perform is small; it needs to know how to read, write, copy, delete, get the size and get the hash of a file.
Previous versions of Cromwell approach these operations in an isolated manner. Each functionality that needs to interact with a file does so without knowledge of anything else going on in the system. Thus, each operation translates to at least one HTTP request. This becomes an issue when interacting with cloud file systems that limit the number of requests per amount of time.
For those who are unfamiliar with Cromwell’s tech stack, Cromwell uses the Akka Actor framework. This framework implements something called the Actor model. If you are interested in learning more about actors and Akka’s implementation of it, you will find plenty of articles, tutorials and books on the subject. In a nutshell, it gives the ability to define self contained entities called, wait for it … actors that communicate with each other through a messaging system. This approach makes building highly concurrent systems a lot easier than having to explicitly deal with, e.g. locking and thread manipulation.
Cromwell has different types of actors that mostly represent logical concepts of the workflow itself. For instance, each workflow will have its own actor, and each job will also have its own actor(s). These actors communicate with each other to ensure the workflow runs properly.
However, previous Cromwell versions do not have actors that communicate file handling. This means that every other actor could start making as many HTTP calls as they wanted at their discretion. Such requests could quickly end up hitting quota limits and in turn produce transient errors. The overall effect is an unstable and slow system.
To solve this problem we created a new actor for the whole system that is in charge of centralizing I/O requests. Cromwell’s centralizing I/O Actor uses the Akka Stream library to process the requests coming from the rest of the system. Besides allowing some interesting optimizations, this immediately gives us two advantages. One, we can throttle, i.e. control how many operations we allow in a given amount of time. Two, we can batch requests together as one operation. Below, I explain how these two features minimize the impact of quota limitations. In addition, I explain a third feature, retry, that increases the success of jobs.
Throttling requests deals with the quota limitation that most cloud providers place on their APIs. This quota takes the form X requests per Y seconds, where X is a finite number that depends on your project. If the rate of file requests reaches the quota, any additional file requests that further increase the rate will cause an error. A first approach is to retry the request. This retry, as well as any other requests, will succeed only once the rate dips under the quota limit. If the rate is at quota, then requests recirculate and aggravate the system much like cars turning on to a road that is already jammed up with traffic. This makes the problem worse.
Cromwell’s I/O Actor limits or throttles the rate of requests to the provider quota. This removes circulating rejected requests in the system, along with the additional wall-clock time it would have taken to process these. The rate value can be set in the Cromwell configuration. See the “Config changes” section of the Cromwell 26 release notes.
As an example, the Google Cloud Storage quota for a google project can be found at:
Make sure to replace project_name with the name of your project. The line Queries per 100 seconds per user indicates your quota. Note that in this case, Cromwell is the user, which means the limitation applies simultaneously to all workflows running on the same server.
Now that we centralize the processing of file handling requests, we can take advantage of batching, a feature that some Cloud providers offer. Batching allows grouping a number of requests together as one request. The I/O Actor will automatically batch requests and will receive the response for the batched requests all at once. This is great because it allows us to decrease by a factor the size of the batch the number of HTTP requests we need to make.
In the case of Google Cloud Storage, for example, it is possible to create batches of 100 requests. It is possible to group requests of different types in the same batch. For example, we can have 50 copy requests alongside 50 delete requests in the same batch. However, some requests like reading the content of a file, or uploading a file, cannot be batched. They are processed separately.
Another challenge that comes with I/O operations is that they are not very reliable. They fail more often than we would like. The good news is that most of the times those failures are transient, and simply retrying will get the request to succeed. This is particularly important in Cromwell because if a single I/O operation like reading the return code fails, then Cromwell will fail the job and the entire workflow will fail. The probability of such failure happening becomes non-negligible with workflows containing thousands of jobs. In order to be more robust, the I/O Actor will retry failures that are known to be transient and are likely to succeed the second, or third, or fourth, or fif... alright that’s enough tries.
Not all errors are transient though, like an authentication error for example. Those won’t be retried. The generally accepted way to retry requests is using an exponential backoff strategy. The I/O Actor encapsulates this logic in a central manner, making sure that all requests are retried appropriately if they need to be.
Of course retries integrate with throttling and batching. For example, if the batch response contains some failed requests, they will be retried as well in a future batch.
That's all for today! If you have questions just leave a comment and we'll try to answer promptly. Happy WDLing!