Specify timeout for WDL tasks on JES

dinvladdinvlad Member, Broadie, Dev

Hi Team,

Is it possible to set a timeout for WDL tasks that run on Google cloud? Some backends (e.g. BCS) seem to support a timeout runtime attribute, as indicated in cromwell.examples.conf. Is there such a setting for JES backend?

Thanks

Answers

  • RuchiRuchi Member, Broadie, Moderator, Dev admin

    Hey @dinvlad,

    It's possible to apply such a timeout on the Pipelines API v2 backend. For the use case you have in mind, would it be sufficient to have a workflow option with such a timeout?

  • tlrtlr Member
    edited May 5
    Hi @dinvlad and @ruchi,

    What is the status of this? I am encountering a problem on Google cloud (Pipelines API v2) where some of my tasks hang (very rarely, but I am scattering many tasks so it occurs often). If I force quit the run and restart it, it works just fine.

    It would be great to specify some time after which the task is considered failed and Cromwell either exits gracefully or re-tries the task.

    Thanks for such making such a useful tool!
  • dinvladdinvlad Member, Broadie, Dev
    edited May 6

    Hi tlr,

    We’ve observed the same issue and, afaik the Cromwell team is looking into it. Thanks for reporting!

    If you could provide operation IDs, that’d be hugely helpful though!

    Best

  • tlrtlr Member
    Hi dinvlad,

    What do you mean by operation IDs?

    Also, it seems unpredictable. For instance, I ran the same workflow (that should take <12 hours) three different times without changing anything. Two times, it stalled out on a task (different each time) and hung there for days until I killed it. One time, no stalling occurred and everything ran through just fine.

    It seems likely it is more a Google pipelines issue rather than a Cromwell issue. I guess the question is can Cromwell be enhanced to identify and restart such cases (for instance if one knows a task should take an hour or so at most).

    Thanks!
  • dinvladdinvlad Member, Broadie, Dev

    Hi tlr,

    In Pipelines API v2, there's a jobId parameter for each Cromwell task call. It's in the form projects/your-project/operations/1234567890. This is what we need to pass to Cromwell developers so they could troubleshoot it. You can get the value of jobId for each task call from Cromwell API.

    Btw, we're also experiencing these as random issues. I also think the problem is on Google side, rather than Cromwell.

    Thanks

  • ChrisLChrisL Cambridge, MAMember, Broadie, Moderator, Dev admin
    edited May 6

    I've made https://github.com/broadinstitute/cromwell/issues/4946 to track this request.

    Note for @Ruchi - I think it could be a relatively minimal change to wire through the existing timeout runtime attribute (currently BCS only) to PAPIv2. The code line to change is https://github.com/broadinstitute/cromwell/blob/develop/supportedBackends/google/pipelines/v2alpha1/src/main/scala/cromwell/backend/google/pipelines/v2alpha1/GenomicsFactory.scala#L135

  • tlrtlr Member
    Thanks dinvlad and ChrisL,

    I saved the output of each workflow run using "--metadata-output", but I am not sure if the hung jobIds made it to this file (since I had to kill Cromwell). Would a jobId that never finished make it to metadata.json?

    If not I can re-run it a few times and save the std out from Cromwell.
  • dinvladdinvlad Member, Broadie, Dev

    Yep, you can see jobId for each task call inside metadata.json.

Sign In or Register to comment.