Update: July 26, 2019
This section of the forum is now closed; we are working on a new support model for WDL that we will share here shortly. For Cromwell-specific issues, see the Cromwell docs and post questions on Github.

Cromwell is slow to submit jobs in large scatter on JES

Hi GATK team,

I'm running a WDL script on cromwell 34 against a JES backend. The WDL is pretty simple - I am filtering a list of VCF files against a BED file of sites I want to keep. The WDL looks like this: https://gist.github.com/weinstockj/30e0d99d11e9a2633cf7602b74cbf5fe

Cromwell is very slow to submit jobs. I am using a very large input TSV of VCFs (39K files). By slow, I mean I have submitted jobs to a beefy cromwell VM, and it takes over an hour for any JES jobs to spin up. I was previously running this workflow on cromwell 31, where I experienced this issue as well. I'm running Cromwell in server mode, and after workflow submission, it displays very little CPU activity. When running this workflow with a small number of VCF files (100 VCFs), I do not experience the slow job submission. Is there a way that I can re-structure the WDL to avoid the the slow job submission (beyond splitting up things into smaller batches)?

Thanks,
Josh

Tagged:

Answers

  • RuchiRuchi Member, Broadie, Moderator, Dev admin

    Hey Josh,

    It seems like Cromwell is probably very slowly checking for potential cache hits before submitting the job. A few questions:
    1. Is call caching enabled?
    2. If yes, can you retry the workflow with “read_from_cache” set to false in your workflow options?
    3. Are all these input files living in GCS?
    4. Would you be okay with sharing your WDL? I can check if something jumps out as an obviously expensive operation for Cromwell.

    Thanks!

    1. Yes, call caching is enabled.
    2. Yes, all input files live in GCS
    3. Yes, please see github gist in my original post, which has the WDL.

    I was able to get Cromwell to submit jobs (much) faster after removing the call to the "size" command on line 45, which I gather is an expensive operation. My current takeaway is to avoid use of the "size" command in a scatter call.

  • erdanieeeerdanieee Member
    edited May 29
    I had a similar problem. I found that changing the default hashsing-strategy from "file" (md5) to "path+modtime" significantly speed up the process

    Here is the relevant part of my configuration file:

    ```
    filesystems {
    local {
    localization: [ "soft-link", "hard-link", "copy" ]
    caching {
    duplication-strategy: [ "soft-link", "hard-link", "copy" ]
    hashing-strategy: "path+modtime"
    check-sibling-md5: true
    }
    }
    }
    ```
Sign In or Register to comment.