On Monday and Tuesday, November 12-13, the communications team will be out of the office for a U.S. federal holiday and a team event. We will be back in action on November 14th and apologize for any inconvenience this may cause. Thank you for using the forum.

Deleting intermediate files

Say I want to run the GATK pipeline, but after the whole pipeline is completed, there are intermediate files from intermediate steps that I don't want anymore. Is there an easy way to find them and delete them?

Tagged:

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
    It is possible to identify the final outputs that you care about in your WDL and have them copied to a special location, at which point you can simply delete everything in the execution directories if you'd like. It's not done by default by Cromwell, but the wdl_runner used to control Cromwell in [this setup](https://cloud.google.com/genomics/v1alpha2/gatk) does it. I'll ask the Cromwell team if they have instructions for making that happen through Cromwell itself (assuming that's possible).
  • mcovarrmcovarr Cambridge, MAMember, Broadie, Dev

    Cromwell does not currently have a facility for cleaning up intermediate outputs. But as @Geraldine_VdAuwera suggested perhaps using the final_workflow_outputs_dir workflow option would allow you to delete your entire execution directory when your workflow completes?

  • awacsawacs Member

    @mcovarr said:
    Cromwell does not currently have a facility for cleaning up intermediate outputs. But as @Geraldine_VdAuwera suggested perhaps using the final_workflow_outputs_dir workflow option would allow you to delete your entire execution directory when your workflow completes?

    Does the final output directory preserve the cromwell directory structure? And does it save all the output for all the tasks, or if it just saves workflow outputs?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    It copies over whatever you include in the workflow outputs block in the WDL. As such the directory structure is not preserved in the outputs folder, but if for example an output is a glob, I think that gets copied over as a directory. The devs may have additional details to offer on Monday.

  • danbdanb Member, Broadie ✭✭

    Globbing can indeed preserve the output's directory structure, as seen here.

  • jfikseljfiksel Member
    edited July 2017

    I am also interested in doing this through Cromwell. However, I would like to just extract the final output file, and not have to deal with all of the internal cromwell directory structure (such as the hash number). I used the final_workflow_outputs_dir, but this seemed to simply move everything inside of the cromwell-executions, including all of the directories, to the desired output directory. Is there a way to just move the final file(s) to the desired output directory?

  • RuchiRuchi Member, Broadie, Moderator, Dev admin

    @jfiksel There is no way to currently move just the outputs, however there is a ticket filed to flatten the directory structure. Would that help your use case?

  • @Ruchi I think that should do the trick. I mostly want to bypass the hash directory, so that I know what the final output path is before running the workflow. I also imagine you could add in a task to your workflow that copies the final output to a pre-specified directory outside of the cromwell-execution directory (I believe this was a solution suggested in another discussion, I forgot where I saw this though).

  • Ok, so I've found having the following task does this for me:

    task copyFinalOutput {
        File outputFile
        String outputDir
        command {
            mkdir -p ${outputDir}
            cp ${outputFile} ${outputDir}
        }
        output {
            File movedFile="${outputFile}"
        }
    }
    
    

    In the workflow you can then have something like

    workflow myWorkflow {
         File inputFile
         String finalOutputDir
         call taskA {
               input:
                  inputFile=inputFile
         }
         call copyFinalOutput{
               input:
                  outputFile=taskA.output
                  outputDir=finalOutputDir
         }
    

    If there are better solutions (maybe the output to the copyFinalOutput task shouldn't be a File), let me know!

  • RuchiRuchi Member, Broadie, Moderator, Dev admin

    Hey @jfiksel, your solution seems reasonable as it works! :+1: However, it seems like there would be issues once docker is involved and that would prevent one from porting this WDL to a cloud backend. I believe some users solve this by running a separate script outside of Cromwell to copy output files to a designated directory by grabbing the final output paths from the workflow metadata. I would think these are all viable workarounds until the feature is implemented into Cromwell itself.

  • dinvladdinvlad Member, Broadie, Dev

    Hmm we're using wdl_runner and it does not automatically clean up intermediates. waidw?

  • dinvladdinvlad Member, Broadie, Dev
    edited March 9

    Here's a task for those who want to do cleanup within a single workflow when run on Google cloud:

    task CleanUpAndExport {
      Array[String] intermediates
      Array[String] outputs
      String outputs_dir
    
      command {
        gsutil rm -I < ${write_lines(intermediates)}
        gsutil mv -I ${outputs_dir} < ${write_lines(outputs)}
      }
      runtime {
        docker: "google/cloud-sdk"
      }
    }
    

    We then call this task with a list of intermediate outputs from all of the previous calls, a list of outputs from those calls we want to output, and our outputs_dir:

    workflow Hello {
      ...
      call Task1 { ... }
      call Task2 { input: in = Task1.out }
      ...
      call CleanUpAndExport { input:
        intermediates = [ Task1.out, ... ],
        outputs = [ Task2.out, ... ],
        outputs_dir = ...
      }
      output: {
        String out = Task2.out
        ...
      }
    }
    
    

    Note that the output section of the workflow here exports those filenames as type String instead of File, since the files will be moved by CleanUp task before that (this is to avoid having 2 copies of those outputs, one in the outputs_dir and one in task call folders).

Sign In or Register to comment.