To celebrate the release of GATK 4.0, we are giving away free credits for running the GATK4 Best Practices pipelines in FireCloud, our secure online analysis portal. It’s first come first serve, so sign up now to claim your free credits worth $250. Sponsored by Google Cloud. Learn more at

call caching speedup

Call caching is surprisingly slow, considering that cromwell only has to copy over existing results. This is because cromwell calculates the md5sum hash of each output file, to make sure it hasn't changed on disk since it was created.

For me, a typical analysis generates ~500GB of data, which is a lot to read in and calculate the hash over, which is what slows call caching down.

If you run into the same problem and are confident that you never manually change the cromwell output files, this is how I solved this issue.

In your cromwell config file, add the following settings.

# Possible values: file, path
# "file" will compute an md5 hash of the file content.
# "path" will compute an md5 hash of the file path. This strategy will only be effective
# in order to allow for the original file path to be hashed.
hashing-strategy: "file"

# When true, will check if a sibling file with the same name and the .md5 extension exis
# If false or the md5 does not exist, will proceed with the above-defined hashing strate
check-sibling-md5: true

Next, modify the wdl tasks that generate big output files to include this code to generate the .md5 files.

# Create md5 checksum file for call caching
md5sum "${samplename}_1.trimmed.fastq.gz" | cut -d ' ' -f 1 | tr -d '\n' > "${samplename}_1.trimmed.fastq.gz.md5"
md5sum "${samplename}_2.trimmed.fastq.gz" | cut -d ' ' -f 1 | tr -d '\n' > "${samplename}_2.trimmed.fastq.gz.md5"

In the next version of cromwell (version 30), it should be possible to enable .md5 sibling files for all tasks by including the following in your cromwell settings file, thanks to @kshakir. See this pull request for details.

# `script-epilogue` configures a shell command to run after the execution of every command block.
# If this value is not set explicitly, the default value is `sync`, equivalent to:
script-epilogue = """
for file in `find .`; do
    if [ -f "$file" ]; then
        echo -n "`md5sum "$file" | cut -d ' ' -f 1`" > "$file.md5"

This should not slow down the analysis: when the task has just completed, the output files are most likely still in the file system cache (RAM), so calculating the md5sum immediately after generating the output files is much faster then reading the files from disk later.


  • kshakirkshakir Broadie, Dev

    Have you had a chance to try out script-epilogue generation of md5's? I'm curious if this works for your issue in Cromwell 30.

  • Hi @kshakir, thanks for reminding me of this question.

    To test this, I've run my trim workflow on 50 samples with cromwell version 30-16f3632 on an empty database. For each setting, I've ran the analysis three times.
    Cromwell does create the .md5 files automatically, but it looks like it still calculates the file hashes on the intput file. Is there a requirement for the format or the file name of the .md5 file? Right now, I'm using the convention from picard's CREATE_MD5_FILE, which is only the md5 sum, on a single line, without a newline at the end.

    $ ls 103200-001-001_1.trimmed.fastq.gz*
    103200-001-001_1.trimmed.fastq.gz  103200-001-001_1.trimmed.fastq.gz.md5
    $ cat 103200-001-001_1.trimmed.fastq.gz.md5;echo

    As you can see, using the 'sibling md5' files is no faster then regular call caching. The y-axis shows the time to complete the workflow in minutes, the first run of every setting takes the longest because each run starts with a clean database, so there are not results to re-use.

  • Just to add, I also see a lot of read activity on my HD when running this, IO-wait goes to 25% and I can see that the cromwell process is using 40% CPU, which I presume is due to the md5 hashing and waiting on the hard disk.

  • ChrisLChrisL Cambridge, MAMember, Broadie, Dev

    The MD5ing has to happen on the task's input files (since the task is a function of its input files and any differences will make a cache hit invalid). A sibling MD5 will help if your input file has an MD5 next to it. eg:

    File file
    call A { input: f = file}
    call B {input: f = A.out}

    If A's epilogue makes a sibling MD5 then B won't have to MD5 A.out to decide whether to call cache.

    Did you ever investigate whether the file-path-based "hashing" might help you out?

Sign In or Register to comment.