call caching speedup

Call caching is surprisingly slow, considering that cromwell only has to copy over existing results. This is because cromwell calculates the md5sum hash of each output file, to make sure it hasn't changed on disk since it was created.

For me, a typical analysis generates ~500GB of data, which is a lot to read in and calculate the hash over, which is what slows call caching down.

If you run into the same problem and are confident that you never manually change the cromwell output files, this is how I solved this issue.

In your cromwell config file, add the following settings.

# Possible values: file, path
# "file" will compute an md5 hash of the file content.
# "path" will compute an md5 hash of the file path. This strategy will only be effective
# in order to allow for the original file path to be hashed.
hashing-strategy: "file"

# When true, will check if a sibling file with the same name and the .md5 extension exis
# If false or the md5 does not exist, will proceed with the above-defined hashing strate
check-sibling-md5: true

Next, modify the wdl tasks that generate big output files to include this code to generate the .md5 files.

# Create md5 checksum file for call caching
md5sum "${samplename}_1.trimmed.fastq.gz" | cut -d ' ' -f 1 | tr -d '\n' > "${samplename}_1.trimmed.fastq.gz.md5"
md5sum "${samplename}_2.trimmed.fastq.gz" | cut -d ' ' -f 1 | tr -d '\n' > "${samplename}_2.trimmed.fastq.gz.md5"

In the next version of cromwell (version 30), it should be possible to enable .md5 sibling files for all tasks by including the following in your cromwell settings file, thanks to @kshakir. See this pull request for details.

# `script-epilogue` configures a shell command to run after the execution of every command block.
#
# If this value is not set explicitly, the default value is `sync`, equivalent to:
script-epilogue = """
for file in `find .`; do
    if [ -f "$file" ]; then
        echo -n "`md5sum "$file" | cut -d ' ' -f 1`" > "$file.md5"
    fi
done
sync
"""

This should not slow down the analysis: when the task has just completed, the output files are most likely still in the file system cache (RAM), so calculating the md5sum immediately after generating the output files is much faster then reading the files from disk later.

Best Answer

Answers

  • kshakirkshakir Broadie, Dev ✭✭

    Have you had a chance to try out script-epilogue generation of md5's? I'm curious if this works for your issue in Cromwell 30.

  • Hi @kshakir, thanks for reminding me of this question.

    To test this, I've run my trim workflow on 50 samples with cromwell version 30-16f3632 on an empty database. For each setting, I've ran the analysis three times.
    Cromwell does create the .md5 files automatically, but it looks like it still calculates the file hashes on the intput file. Is there a requirement for the format or the file name of the .md5 file? Right now, I'm using the convention from picard's CREATE_MD5_FILE, which is only the md5 sum, on a single line, without a newline at the end.

    $ ls 103200-001-001_1.trimmed.fastq.gz*
    103200-001-001_1.trimmed.fastq.gz  103200-001-001_1.trimmed.fastq.gz.md5
    
    $ cat 103200-001-001_1.trimmed.fastq.gz.md5;echo
    454a1d1690a5fea3789507d0504d448a
    


    As you can see, using the 'sibling md5' files is no faster then regular call caching. The y-axis shows the time to complete the workflow in minutes, the first run of every setting takes the longest because each run starts with a clean database, so there are not results to re-use.

  • Just to add, I also see a lot of read activity on my HD when running this, IO-wait goes to 25% and I can see that the cromwell process is using 40% CPU, which I presume is due to the md5 hashing and waiting on the hard disk.

  • ThibThib CambridgeMember, Broadie, Dev

    This also means that even with an epilogue that creates the sibling MD5, it won't have an effect for the workflow input files.
    So if you run this twice:

    workflow my_workflow {
       File big_input
       call my_task { input: f = big_input }
    }
    

    Even with your epilogue script and call caching on, big_input will bemd5ed by Cromwell every time even though it will call cache the second time.

  • @ChrisL
    Thanks for the suggestion. For now I've switched to using "soft-link" to localise files, in combination with file path hashing, which is a lot faster.

Sign In or Register to comment.