Overzealous call caching

For our analysis, we use a input file with all our samples of format samplename /path/to/forward_reads.fastq.gz /path/to/reverse_reads.fastq.gz. I wanted to add support for spaces in filenames, which does not work currently. So I made a folder with a space in the name and hardlinked the fastq files there.

mkdir path\ with\ space
cd path\ with\ space
ln /path/to/*.fastq.gz .

I then updated the input file to include samplename /path with space/forward_reads.fastq.gz /path with space/reverse_reads.fastq.gz and ran the analysis with cromwell on this file.

To my surprise, the analysis completed successfully. However, when I look at the cromwell-execution folder, there are not inputs folders for any of the tasks, and when I look in the script file I see the inputs refer to /path/to/forward_reads.fastq.gz, instead of /path with space/forward_reads.fastq.gz.

Is this the expected behaviour? I would expect cromwell to notice the sample input file has changed, which should invalidate all cached calls that inherit from that file. How can I make sure that cromwell doesn't re-use the results from an earlier version when I change the input file?

Best Answer

Answers

  • @ChrisL
    That makes sense, thanks!

    I have a related question, I'm trying out running analysis (with call caching) where the raw data is on a locally mounted sftp server. As far as I can tell (from looking at the network activity), cromwell only fetches the raw data from the sftp mount once. So I guess it re-uses the localised input files from the call cache instead of reading it from the sftp again.
    Is that the case? Wouldn't cromwell still need to read the input files from the sfpt server to calculate the checksum?

Sign In or Register to comment.