Ever wish you could automatically remove your unwanted output files from a submission without having to manually review them? If so, take this two minute survey and tell us more.
Latest Release: 1/10/19
Release Notes can be found here.

Specifying Intermediate Files

Our pipeline produces large intermediate files that result in high storage cost. We added a task at the end of the pipeline to delete these intermediates, but the call cache doesn't work if we delete them. We would like an easy way to delete these intermediate files and still have the cache work.

Tagged:
1
1 votes

Active · Last Updated

Comments

  • RuchiRuchi Member, Broadie, Moderator, Dev admin

    Hello @qwessell,

    "Call caching" relies on the original output files being available from intermediate task in order to succeed, because this algorithm caches the outputs from each job/call that has been run for a workflow. It seems like what you're possibly asking for is a way to cache on an entire workflow?

    For my own understanding, would you mind describing the use case of wanting to be able to cache an entire workflow? This feature would enable one to re-run the same workflow with the same inputs and make a duplicate copy of the original outputs. What is the added advantage of having this?

    Thanks!

  • gkugenergkugener Member

    Hi @Ruchi ,

    The use case is a workflow were we are realigning bam files. This workflow contains intermediate tasks that convert from bam to fastq, fastq to bam and then quantify the results in the last task from the bam. The intermediate files that are outputs from the bam to fastq workflow are very large (~40gb) so this swells the storage cost, so we remove them when we are done. However, when new samples are added, we would like to run the whole set of samples together again, but no rerun samples that have previously been quantified unless the wdl has changed.

    Thanks!

  • RuchiRuchi Member, Broadie, Moderator, Dev admin

    Hey @gkugener

    I think what you're asking for a reasonable feature, however the way Cromwell works today is that it checks whether the output of each task in a workflow exists to resume.

    Is there a place where you can separate the pipeline so that the workflow for an individual sample is it own pipeline and workflow, whereas the the steps that aggregate multiple samples are its own pipeline? This way, you don't have to re-run the tasks related to that single sample, and only keep around intermediates that are needed for the aggregated pipeline?

  • gkugenergkugener Member

    Ok this is good to know. The pipeline is already set up in this way in that we could just specify a sample set with new samples on which to run the per sample pipeline and then use a combined sample set to run any aggregation of files. I think this is how we will proceed for now. Thanks for the help!

Sign In or Register to comment.