Heads up:
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
Update: July 26, 2019
This section of the forum is now closed; we are working on a new support model for WDL that we will share here shortly. For Cromwell-specific issues, see the Cromwell docs and post questions on Github.
This section of the forum is now closed; we are working on a new support model for WDL that we will share here shortly. For Cromwell-specific issues, see the Cromwell docs and post questions on Github.
Deleting intermediate files

Say I want to run the GATK pipeline, but after the whole pipeline is completed, there are intermediate files from intermediate steps that I don't want anymore. Is there an easy way to find them and delete them?
Answers
Cromwell does not currently have a facility for cleaning up intermediate outputs. But as @Geraldine_VdAuwera suggested perhaps using the
final_workflow_outputs_dir
workflow option would allow you to delete your entire execution directory when your workflow completes?Does the final output directory preserve the cromwell directory structure? And does it save all the output for all the tasks, or if it just saves workflow outputs?
It copies over whatever you include in the workflow outputs block in the WDL. As such the directory structure is not preserved in the outputs folder, but if for example an output is a glob, I think that gets copied over as a directory. The devs may have additional details to offer on Monday.
Globbing can indeed preserve the output's directory structure, as seen here.
I am also interested in doing this through Cromwell. However, I would like to just extract the final output file, and not have to deal with all of the internal cromwell directory structure (such as the hash number). I used the
final_workflow_outputs_dir
, but this seemed to simply move everything inside of thecromwell-executions
, including all of the directories, to the desired output directory. Is there a way to just move the final file(s) to the desired output directory?@jfiksel There is no way to currently move just the outputs, however there is a ticket filed to flatten the directory structure. Would that help your use case?
@Ruchi I think that should do the trick. I mostly want to bypass the hash directory, so that I know what the final output path is before running the workflow. I also imagine you could add in a task to your workflow that copies the final output to a pre-specified directory outside of the cromwell-execution directory (I believe this was a solution suggested in another discussion, I forgot where I saw this though).
Ok, so I've found having the following task does this for me:
In the workflow you can then have something like
If there are better solutions (maybe the output to the
copyFinalOutput
task shouldn't be a File), let me know!Hey @jfiksel, your solution seems reasonable as it works!
However, it seems like there would be issues once docker is involved and that would prevent one from porting this WDL to a cloud backend. I believe some users solve this by running a separate script outside of Cromwell to copy output files to a designated directory by grabbing the final output paths from the workflow metadata. I would think these are all viable workarounds until the feature is implemented into Cromwell itself.
Hmm we're using wdl_runner and it does not automatically clean up intermediates. waidw?
Here's a task for those who want to do cleanup within a single workflow when run on Google cloud:
We then call this task with a list of
intermediate
outputs from all of the previous calls, a list ofoutputs
from those calls we want to output, and ouroutputs_dir
:Note that the
output
section of the workflow here exports those filenames as typeString
instead ofFile
, since the files will be moved byCleanUp
task before that (this is to avoid having 2 copies of those outputs, one in theoutputs_dir
and one in task call folders).