Update: July 26, 2019
This section of the forum is now closed; we are working on a new support model for WDL that we will share here shortly. For Cromwell-specific issues, see the Cromwell docs and post questions on Github.

Cromwell is slow to submit jobs in large scatter on JES

Hi GATK team,

I'm running a WDL script on cromwell 34 against a JES backend. The WDL is pretty simple - I am filtering a list of VCF files against a BED file of sites I want to keep. The WDL looks like this: https://gist.github.com/weinstockj/30e0d99d11e9a2633cf7602b74cbf5fe

Cromwell is very slow to submit jobs. I am using a very large input TSV of VCFs (39K files). By slow, I mean I have submitted jobs to a beefy cromwell VM, and it takes over an hour for any JES jobs to spin up. I was previously running this workflow on cromwell 31, where I experienced this issue as well. I'm running Cromwell in server mode, and after workflow submission, it displays very little CPU activity. When running this workflow with a small number of VCF files (100 VCFs), I do not experience the slow job submission. Is there a way that I can re-structure the WDL to avoid the the slow job submission (beyond splitting up things into smaller batches)?




  • RuchiRuchi admin Member, Broadie, Moderator, Dev admin

    Hey Josh,

    It seems like Cromwell is probably very slowly checking for potential cache hits before submitting the job. A few questions:
    1. Is call caching enabled?
    2. If yes, can you retry the workflow with “read_from_cache” set to false in your workflow options?
    3. Are all these input files living in GCS?
    4. Would you be okay with sharing your WDL? I can check if something jumps out as an obviously expensive operation for Cromwell.


  • jweinstkjweinstk Member
    1. Yes, call caching is enabled.
    2. Yes, all input files live in GCS
    3. Yes, please see github gist in my original post, which has the WDL.

    I was able to get Cromwell to submit jobs (much) faster after removing the call to the "size" command on line 45, which I gather is an expensive operation. My current takeaway is to avoid use of the "size" command in a scatter call.

  • erdanieeeerdanieee Member
    edited May 29
    I had a similar problem. I found that changing the default hashsing-strategy from "file" (md5) to "path+modtime" significantly speed up the process

    Here is the relevant part of my configuration file:

    filesystems {
    local {
    localization: [ "soft-link", "hard-link", "copy" ]
    caching {
    duplication-strategy: [ "soft-link", "hard-link", "copy" ]
    hashing-strategy: "path+modtime"
    check-sibling-md5: true
Sign In or Register to comment.