To celebrate the release of GATK 4.0, we are giving away free credits for running the GATK4 Best Practices pipelines in FireCloud, our secure online analysis portal. It’s first come first serve, so sign up now to claim your free credits worth $250. Sponsored by Google Cloud. Learn more at

How much memory does Cromwell need for input or output files?

mmahmmah Member, Broadie

I am currently running Cromwell in a SLURM environment, where jobs fail if they exceed their requested memory limits. I am encountering workflow problems where my jobs are failing, and I am trying to understand why this is happening.

I have working:

  1. bcl2fastq on SLURM, with 2 cores, 16 GB memory
  2. Cromwell with simple worklfow on SLURM, with backend modified from LSF

I am now trying to combine these two working items, but my jobs are failing, at least partially due to insufficient memory and timeout errors. I do not understand when input or output is written to disk or is held in memory, or which process needs this memory (Cromwell process or task process). All I know at present is that when combining bcl2fastq with Cromwell there are failures somewhere, and at least one error message from a child process shows a SLURM out of memory error.

bcl2fastq takes as input a directory, which I am passing to the WDL task as a String. The child process stderr indicates the input files can be read normally.
The output of bcl2fastq is large when run without Cromwell: 10's of GB. With Cromwell, I see no data output to the execution directory.

How much memory does Cromwell need to handle a WDL task with bcl2fastq? Is this 1x output? 2x output? Does this memory need to be allocated to the SLURM process running Cromwell itself, or to the job running bcl2fastq, or to both? For a task consuming this output, how much memory is required beyond what is needed without Cromwell?

If I use a MySQL database instead of the in-memory database, how is this memory affected? Is task output data stored in the database, or is this only job metadata?

Although my example here is bcl2fastq, I also have other custom processing that ingests and outputs similarly large data sets, so I am generally looking for guidance on memory usage for Cromwell and its child jobs scaling with input and output data size.


  • KateNKateN Cambridge, MAMember, Broadie, Moderator

    So it looks like you know that you need 10GB of memory for your job. Are you specifying this using the runtime attribute in your WDL script? If you are, are you sure that the memory attribute is correctly being added to the SLURM command line? (Given that you are using this modified backend, your memory specification could be getting hung up here.) Have you tried testing the configuration with a smaller command--one that would normally use far less than 10 GB of memory?

    And unless you are using read_string, Cromwell won't ever read the input files into the memory itself, so you don't need to worry about that using up your compute.

  • mmahmmah Member, Broadie

    I have sorted out the issue (default output directory was not current directory) with data output in the execution directory, but this is still separate from the memory issue.

  • mmahmmah Member, Broadie

    I am pretty sure the memory attributes are being added to the SLURM command line correctly. I have checked this in the execution/script.submit file, and also discovered that some values for runtime variable name are not allowed: I have tested a simple WDL workflow with trivial python scripts that runs successfully for Cromwell on SLURM.

    I ask about how memory usages scales with input/output size because this statement in the Cromwell readme is ambiguous about what gets stored in the database.

    Cromwell uses either an in-memory or MySQL database to track the execution of workflows and store outputs of task invocations.

    If memory usage does not scale with input/output at all, then I am probably simply below the minimum I need somewhere.

  • kshakirkshakir Broadie, Dev

    If I use a MySQL database instead of the in-memory database, how is this memory affected? Is task output data stored in the database, or is this only job metadata?

    Both output paths and job metadata are store in the database. In general the path to a bam and the path to an index are the same amount of memory for Cromwell to keep track of temporarily and within the database.

    NOTE: There are specific cases where one can misconfigure their Cromwell and it will try to read the entire file into memory, but those should be turned off anyway or Cromwell will not run effectively (i.e. turning on call-caching while using a Local/HPC backend and leaving the Cromwell hashing-strategy as file instead of path while generating large files).

    In terms of other things that are stored within the database besides file paths, the original WDL, the original inputs provided, and the contents of outputs (if you read_string on a large value, that large value is stored in the database), etc. are stored for provenance. For a large workflow or large number of workflows, this various data stored via an in-memory database can build up, increasing the amount of memory required for the Cromwell process to run.

    Relatedly, the in-memory use of a database only lasts as long as Cromwell is running and does not survive restarts. When one wants jobs to keep running across Cromwell restarts, or wants save the results of previous runs for say call-caching purposes, one should be using a MySQL database. In this case, Cromwell will still generate temporary data variables in memory, but after they are successfully stored in MySQL the Java garbage collector will remove the temporary values. This allows Cromwell to run many more workflows over time by storing the results in MySQL.

    NOTE: Currently if one tries to run a larger number of workflows at the same time, Cromwell will need a significant amount of memory (say -Xmx16g or more) while it keeps track of the various workflows and jobs in memory.

Sign In or Register to comment.