To celebrate the release of GATK 4.0, we are giving away free credits for running the GATK4 Best Practices pipelines in FireCloud, our secure online analysis portal. It’s first come first serve, so sign up now to claim your free credits worth $250. Sponsored by Google Cloud. Learn more at https://software.broadinstitute.org/firecloud/documentation/freecredits

Deleting intermediate files

Say I want to run the GATK pipeline, but after the whole pipeline is completed, there are intermediate files from intermediate steps that I don't want anymore. Is there an easy way to find them and delete them?

Tagged:

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie
    It is possible to identify the final outputs that you care about in your WDL and have them copied to a special location, at which point you can simply delete everything in the execution directories if you'd like. It's not done by default by Cromwell, but the wdl_runner used to control Cromwell in [this setup](https://cloud.google.com/genomics/v1alpha2/gatk) does it. I'll ask the Cromwell team if they have instructions for making that happen through Cromwell itself (assuming that's possible).
  • mcovarrmcovarr Cambridge, MAMember, Broadie, Dev

    Cromwell does not currently have a facility for cleaning up intermediate outputs. But as @Geraldine_VdAuwera suggested perhaps using the final_workflow_outputs_dir workflow option would allow you to delete your entire execution directory when your workflow completes?

  • awacsawacs Member

    @mcovarr said:
    Cromwell does not currently have a facility for cleaning up intermediate outputs. But as @Geraldine_VdAuwera suggested perhaps using the final_workflow_outputs_dir workflow option would allow you to delete your entire execution directory when your workflow completes?

    Does the final output directory preserve the cromwell directory structure? And does it save all the output for all the tasks, or if it just saves workflow outputs?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    It copies over whatever you include in the workflow outputs block in the WDL. As such the directory structure is not preserved in the outputs folder, but if for example an output is a glob, I think that gets copied over as a directory. The devs may have additional details to offer on Monday.

  • danbdanb Member, Broadie

    Globbing can indeed preserve the output's directory structure, as seen here.

  • jfikseljfiksel Member
    edited July 2017

    I am also interested in doing this through Cromwell. However, I would like to just extract the final output file, and not have to deal with all of the internal cromwell directory structure (such as the hash number). I used the final_workflow_outputs_dir, but this seemed to simply move everything inside of the cromwell-executions, including all of the directories, to the desired output directory. Is there a way to just move the final file(s) to the desired output directory?

  • RuchiRuchi Member, Broadie, Dev

    @jfiksel There is no way to currently move just the outputs, however there is a ticket filed to flatten the directory structure. Would that help your use case?

  • @Ruchi I think that should do the trick. I mostly want to bypass the hash directory, so that I know what the final output path is before running the workflow. I also imagine you could add in a task to your workflow that copies the final output to a pre-specified directory outside of the cromwell-execution directory (I believe this was a solution suggested in another discussion, I forgot where I saw this though).

  • Ok, so I've found having the following task does this for me:

    task copyFinalOutput {
        File outputFile
        String outputDir
        command {
            mkdir -p ${outputDir}
            cp ${outputFile} ${outputDir}
        }
        output {
            File movedFile="${outputFile}"
        }
    }
    
    

    In the workflow you can then have something like

    workflow myWorkflow {
         File inputFile
         String finalOutputDir
         call taskA {
               input:
                  inputFile=inputFile
         }
         call copyFinalOutput{
               input:
                  outputFile=taskA.output
                  outputDir=finalOutputDir
         }
    

    If there are better solutions (maybe the output to the copyFinalOutput task shouldn't be a File), let me know!

  • RuchiRuchi Member, Broadie, Dev

    Hey @jfiksel, your solution seems reasonable as it works! :+1: However, it seems like there would be issues once docker is involved and that would prevent one from porting this WDL to a cloud backend. I believe some users solve this by running a separate script outside of Cromwell to copy output files to a designated directory by grabbing the final output paths from the workflow metadata. I would think these are all viable workarounds until the feature is implemented into Cromwell itself.

Sign In or Register to comment.