Forum Login Issue:
Currently the "Log in with Google" button redirects you to a "Page not found." This is an issue that our forum vendors are working on fixing. In the meantime, while on the "Page not found" you can edit the URL to delete the second gatk, firecloud, or wdl (depending on what subforum you are acessing).
ex: https://gatkforums.broadinstitute.org/gatk/gatk/entry/...

Intermediate Outputs

elcinchu27elcinchu27 BroadMember, Broadie

In a multi-task workflow, I feed the output of one task to the input of the next. In the end, I only want the last of them, but the rest of the intermediate outputs go to the data model. How could I avoid that?

Thank you for your help,

Tagged:

Best Answer

Answers

  • elcinchu27elcinchu27 BroadMember, Broadie

    I think that it would be more time-saving if all the intermediate files were automatically removed in the end of the workflow.

  • KateNKateN Cambridge, MAMember, Broadie, Moderator

    Unfortunately that is not possible on the local backend. If you use a cloud backend, such as Google JES, it will automatically remove any undefined outputs.

  • awacsawacs Member

    @KateN said:

    Unfortunately that is not possible on the local backend. If you use a cloud backend, such as Google JES, it will automatically remove any undefined outputs.

    Can you explain how that happens? I am trying to achieve this in AWS and would like some inspiration on how that happens in Google JES.

  • dinvladdinvlad Member, Broadie, Dev
    edited March 9

    +1, it doesn't seem to remove intermediate files even when run through JES/PAPI. To clarify, we're not talking about the files stored on disk, but those that are stored in the individual call- folder on GCS for each task. Moreover, it keeps a copy of the files specified in the outputs section inside their call- folders as well.

  • ChrisLChrisL Cambridge, MAMember, Broadie, Moderator, Dev

    I think the confusion here is that on a task level, any files not declared in the task outputs will be left on the VM when it gets recycled and therefore lost. On a workflow level, that doesn't happen.

    Historically, the reason is that people have favored call caching over tidy-up so this hasn't ever risen to the top of the new-feature priority list. Having said that, there was some chatter on the Cromwell gitter channel last week regarding an external contribution to provide this so it may happen soon!

  • dinvladdinvlad Member, Broadie, Dev
    edited March 13

    I think it makes sense now. Call caching relies on CRC32 of the files stored on GCS, so it must keep those files around to work properly. It would be nice to make that optional though (i.e. remove files but still have proper caching, e.g. based on file paths)!

  • ChrisLChrisL Cambridge, MAMember, Broadie, Moderator, Dev

    @dinvlad we don't need the input files for call caching (we just check our stored hash matches the hash of the new input). We need output files so that we can mimic "run the job and produce the output" by copying the previous result.

    ie in the case of a file produced by task A and used by B, we don't have to keep the intermediate file so that we can call cache B, we have to keep it so that we can call cache A.

  • dinvladdinvlad Member, Broadie, Dev
    edited March 13

    Chris - understood. What we call "intermediate files" are the declared outputs of the tasks. So our problem is with storing all of those outputs instead of just the final outputs of the workflow. Ideally, we'd like to keep only the latter while preserving call caching, though I understand if that's not possible in the current approach to call caching (i.e. using checksums of objects on GCS). Thanks

  • dinvladdinvlad Member, Broadie, Dev

    Obviously, this wouldn't work in many (if not most) cases, e.g. if we change some input that requires execution of only the last step in a chain, then any previous outputs this step depends on must be kept around. What we have in mind is more like skipping the entire branches of execution leading to the final outputs, e.g. if we have a scatter some slices of which didn't change from a previous run, and the output of which is our final output. In this case the steps for that index could be skipped entirely even if we removed all of the intermediate files. I understand how this may be an edge case though.

Sign In or Register to comment.