Cromwell is slow to submit jobs in large scatter on JES

Hi GATK team,

I'm running a WDL script on cromwell 34 against a JES backend. The WDL is pretty simple - I am filtering a list of VCF files against a BED file of sites I want to keep. The WDL looks like this: https://gist.github.com/weinstockj/30e0d99d11e9a2633cf7602b74cbf5fe

Cromwell is very slow to submit jobs. I am using a very large input TSV of VCFs (39K files). By slow, I mean I have submitted jobs to a beefy cromwell VM, and it takes over an hour for any JES jobs to spin up. I was previously running this workflow on cromwell 31, where I experienced this issue as well. I'm running Cromwell in server mode, and after workflow submission, it displays very little CPU activity. When running this workflow with a small number of VCF files (100 VCFs), I do not experience the slow job submission. Is there a way that I can re-structure the WDL to avoid the the slow job submission (beyond splitting up things into smaller batches)?

Thanks,
Josh

Tagged:

Answers

  • RuchiRuchi Member, Broadie, Moderator, Dev admin

    Hey Josh,

    It seems like Cromwell is probably very slowly checking for potential cache hits before submitting the job. A few questions:
    1. Is call caching enabled?
    2. If yes, can you retry the workflow with “read_from_cache” set to false in your workflow options?
    3. Are all these input files living in GCS?
    4. Would you be okay with sharing your WDL? I can check if something jumps out as an obviously expensive operation for Cromwell.

    Thanks!

  • jweinstkjweinstk Member
    1. Yes, call caching is enabled.
    2. Yes, all input files live in GCS
    3. Yes, please see github gist in my original post, which has the WDL.

    I was able to get Cromwell to submit jobs (much) faster after removing the call to the "size" command on line 45, which I gather is an expensive operation. My current takeaway is to avoid use of the "size" command in a scatter call.

Sign In or Register to comment.