Best practices for large scatter operations

I'm using Cromwell 28.2 on Ubuntu 14.04, with Google JES as the backend, to run a scatter operation over a large array of files. When scattering over 500 files, Cromwell takes up about 5-6 GB of RAM. When scattering over ~3000 files, Cromwell takes up around ~30GB. This suggests to me that Cromwell's RAM usage will scale linearly with the number of instances.

I'd like to use Cromwell for much larger scatter operations - say 60,000 files. The above investigations suggest that this will take 600 GB of RAM.

  1. Are there recommendations for reducing Cromwell's RAM usage for very large jobs?
    1a. If not, is splitting up a large job into smaller batches of jobs the only option?
  2. Is the above RAM usage inline with expectations?
Tagged:

Answers

  • kshakirkshakir Broadie, Dev ✭✭

    Hi @jweinstk,

    Talking to a couple others, for the size/scope of a workflow that you're trying to run, perhaps you can try and reduce the number of Files cromwell tracks using file-of-file-filenames (fofn). Basically, store the list of files one wants to access in single file, and then use methods like read_lines to build an array in WDL. Google JES will also have problems with tens of thousands of paths to localize, so often folks needing to do this will build gsutil into their docker image, and then localize the thousands of files inside their wdl task's command with custom calls to gsutil cp ….

    Regarding the RAM expectations, it really depends on the shape of the tasks/workflow. But, I will say we've seen a couple folks allocating tens-of-gb ram (say 20gb) to successfully track tens-of-thousands workflow calls (say 100k calls).

  • jweinstkjweinstk Member

    @kshakir , thanks for the reply! I am already reading in the file names using read_lines on a text file with all of the file paths. My scatter tasks don't need every single file available to every docker VM - the process is more to run a few tasks on each file separately, and then run a few tasks to merge and process the results.

    Some follow up questions:
    1. Are there different settings for the MySQL server that I can change that might reduce the Cromwell RAM usage?
    2. Or perhaps different configurations for changing the way Cromwell communicates with the DB that might help?
    3. Is any of the code for these large jobs (that you have seen working with tens-of-thousands of workflow calls) publicly available?

  • kshakirkshakir Broadie, Dev ✭✭
    1. Are there different settings for the MySQL server that I can change that might reduce the Cromwell RAM usage?
    2. Or perhaps different configurations for changing the way Cromwell communicates with the DB that might help?

    I'm not aware of tweaks for the db that relate to reducing memory usage. Most of the tweaks I can think of involve increasing internal queue and batch sizes, where more data is held in memory. The goal of those bigger sql batches is to decrease the latency due to db client/server round trips.

    1. Is any of the code for these large jobs (that you have seen working with tens-of-thousands of workflow calls) publicly available?

    I'm also not aware of any published WDLs that are on the scale you're mentioning. Even with the WDL, these authors have highly tweaked their compute environments, with GCE nodes with gigs of memory and hefty Cloud SQL instances to hold their cromwell databases.

    the process is more to run a few tasks on each file separately

    While it would be great to have a single workflow that does everything, depending on what you're trying to accomplish, it may be more efficient to run the jobs as 60K individual workflows (with call-caching turned on to help with hiccups). Then when those are done one could get the list of outputs from cromwell, and run a single workflow at the end to merge the results.

  • ChrisLChrisL Cambridge, MAMember, Broadie, Moderator, Dev ✭✭
    edited August 2017
    Hey @jweinstk -

    Another option here would be to pass a FoFN to every shard VM along with its index, and ask it to select the line index you want and localize only that file manually. A google search suggests
    sed 'NUMq;d' test
    for selecting the nth line from a file. So, eg something like:

    task use_one_file {
      File fofn
      Int index
      command {
        FILE=$(sed '${index + 1}q;d' test) # Because sed is 1-indexed not 0-indexed
        gsutil cp $FILE ./file.local # or whatever...
        ./my_command.sh --in=file.local
      }
    }

    workflow lots_of_files {
      File fofn
      scatter(i in range(60000)) {
        call use_one_file { input: fofn = fofn, index = i }
      }
    }


    That way, Cromwell doesn't need to store every one of 60,000 file names (which are often very long) in memory, probably several times, for the lifetime of the workflow. Hopefully that will reduce your memory usage to a more manageable level.
Sign In or Register to comment.