Heads up:
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
Update: July 26, 2019
This section of the forum is now closed; we are working on a new support model for WDL that we will share here shortly. For Cromwell-specific issues, see the Cromwell docs and post questions on Github.
This section of the forum is now closed; we are working on a new support model for WDL that we will share here shortly. For Cromwell-specific issues, see the Cromwell docs and post questions on Github.
Best practices for large scatter operations

I'm using Cromwell 28.2 on Ubuntu 14.04, with Google JES as the backend, to run a scatter operation over a large array of files. When scattering over 500 files, Cromwell takes up about 5-6 GB of RAM. When scattering over ~3000 files, Cromwell takes up around ~30GB. This suggests to me that Cromwell's RAM usage will scale linearly with the number of instances.
I'd like to use Cromwell for much larger scatter operations - say 60,000 files. The above investigations suggest that this will take 600 GB of RAM.
- Are there recommendations for reducing Cromwell's RAM usage for very large jobs?
1a. If not, is splitting up a large job into smaller batches of jobs the only option? - Is the above RAM usage inline with expectations?
Tagged:
Answers
Hi @jweinstk,
Talking to a couple others, for the size/scope of a workflow that you're trying to run, perhaps you can try and reduce the number of
File
s cromwell tracks using file-of-file-filenames (fofn). Basically, store the list of files one wants to access in single file, and then use methods likeread_lines
to build an array in WDL. Google JES will also have problems with tens of thousands of paths to localize, so often folks needing to do this will buildgsutil
into their docker image, and then localize the thousands of files inside their wdl task's command with custom calls togsutil cp …
.Regarding the RAM expectations, it really depends on the shape of the tasks/workflow. But, I will say we've seen a couple folks allocating tens-of-gb ram (say 20gb) to successfully track tens-of-thousands workflow calls (say 100k calls).
@kshakir , thanks for the reply! I am already reading in the file names using read_lines on a text file with all of the file paths. My scatter tasks don't need every single file available to every docker VM - the process is more to run a few tasks on each file separately, and then run a few tasks to merge and process the results.
Some follow up questions:
1. Are there different settings for the MySQL server that I can change that might reduce the Cromwell RAM usage?
2. Or perhaps different configurations for changing the way Cromwell communicates with the DB that might help?
3. Is any of the code for these large jobs (that you have seen working with tens-of-thousands of workflow calls) publicly available?
I'm not aware of tweaks for the db that relate to reducing memory usage. Most of the tweaks I can think of involve increasing internal queue and batch sizes, where more data is held in memory. The goal of those bigger sql batches is to decrease the latency due to db client/server round trips.
I'm also not aware of any published WDLs that are on the scale you're mentioning. Even with the WDL, these authors have highly tweaked their compute environments, with GCE nodes with gigs of memory and hefty Cloud SQL instances to hold their cromwell databases.
While it would be great to have a single workflow that does everything, depending on what you're trying to accomplish, it may be more efficient to run the jobs as 60K individual workflows (with call-caching turned on to help with hiccups). Then when those are done one could get the list of outputs from cromwell, and run a single workflow at the end to merge the results.
Another option here would be to pass a FoFN to every shard VM along with its index, and ask it to select the line index you want and localize only that file manually. A google search suggests for selecting the nth line from a file. So, eg something like:
That way, Cromwell doesn't need to store every one of 60,000 file names (which are often very long) in memory, probably several times, for the lifetime of the workflow. Hopefully that will reduce your memory usage to a more manageable level.