We've moved!
For WDL questions, see the WDL specification and WDL docs.
For Cromwell questions, see the Cromwell docs and please post any issues on Github.

Cromwell is slow to submit jobs in large scatter on JES

Hi GATK team,

I'm running a WDL script on cromwell 34 against a JES backend. The WDL is pretty simple - I am filtering a list of VCF files against a BED file of sites I want to keep. The WDL looks like this: https://gist.github.com/weinstockj/30e0d99d11e9a2633cf7602b74cbf5fe

Cromwell is very slow to submit jobs. I am using a very large input TSV of VCFs (39K files). By slow, I mean I have submitted jobs to a beefy cromwell VM, and it takes over an hour for any JES jobs to spin up. I was previously running this workflow on cromwell 31, where I experienced this issue as well. I'm running Cromwell in server mode, and after workflow submission, it displays very little CPU activity. When running this workflow with a small number of VCF files (100 VCFs), I do not experience the slow job submission. Is there a way that I can re-structure the WDL to avoid the the slow job submission (beyond splitting up things into smaller batches)?




  • RuchiRuchi Member, Broadie, Moderator, Dev admin

    Hey Josh,

    It seems like Cromwell is probably very slowly checking for potential cache hits before submitting the job. A few questions:
    1. Is call caching enabled?
    2. If yes, can you retry the workflow with “read_from_cache” set to false in your workflow options?
    3. Are all these input files living in GCS?
    4. Would you be okay with sharing your WDL? I can check if something jumps out as an obviously expensive operation for Cromwell.


    1. Yes, call caching is enabled.
    2. Yes, all input files live in GCS
    3. Yes, please see github gist in my original post, which has the WDL.

    I was able to get Cromwell to submit jobs (much) faster after removing the call to the "size" command on line 45, which I gather is an expensive operation. My current takeaway is to avoid use of the "size" command in a scatter call.

  • erdanieeeerdanieee Member
    edited May 2019
    I had a similar problem. I found that changing the default hashsing-strategy from "file" (md5) to "path+modtime" significantly speed up the process

    Here is the relevant part of my configuration file:

    filesystems {
    local {
    localization: [ "soft-link", "hard-link", "copy" ]
    caching {
    duplication-strategy: [ "soft-link", "hard-link", "copy" ]
    hashing-strategy: "path+modtime"
    check-sibling-md5: true
Sign In or Register to comment.