We've moved!
For WDL questions, see the WDL specification and WDL docs.
For Cromwell questions, see the Cromwell docs and please post any issues on Github.

Cromwell database outage

I can't find documentation on how Cromwell Server copes if it loses database connections mid transaction. What would happen if a)there is a minor loss of connectivity or b) the database goes completely?


  • ChrisLChrisL Cambridge, MAMember, Broadie, Dev admin
    edited July 2018

    When it loses database connectivity, Cromwell will no longer be able to do anything involving a database connection, including (but not exhaustively):

    • Accepting new workflows
    • Storing and retrieving workflow metadata
    • Starting new jobs in existing workflows

    If it loses its database, you might as well just shut down Cromwell because it won't reconnect (even if the database comes back online) without a service restart.

    Luckily, Cromwell will also not do anything destructive either - if it has already submitted jobs to a cloud backend for execution, they won't be lost, for example. When the database comes back and Cromwell is restarted, it will restart its workflows from where they left off, and reconnect to any existing jobs still running on cloud backends.

    EDIT: If Cromwell is brought down at the same time that the database connection is lost, it can run through a well-tested "reconnection" process when it is restarted against a reconnected database, which should be able to recover any running jobs at the point where the disconnection occurred.

    Post edited by ChrisL on
  • ChrisLChrisL Cambridge, MAMember, Broadie, Dev admin

    EDIT: You can probably disregard the lower half of my earlier response. I've just tried this again and the result was a little more mixed:

    • The first time I brought down the database for 20 seconds, my running workflow was able to continue uninterrupted when the database came back.
    • The second time, my workflow failed immediately when the database came back.

    So I think the more correct answer here is:

    • If the database connection is interrupted, the running workflows might be ok, or they might fail, depending on exactly which point had been reached.
    • Bringing Cromwell down at the same time as the database will allow it to run its restart process on all half-finished workflows when the database returns, which is a more comprehensive and well-tested process than trying to recover when a previously-unavailable database connection suddenly comes back.
    • New workflows (ie ones that hadn't even been submitted when the database was disconnected) ran through fine on the reconnected database - it was only in-progress workflows that sometimes struggled.
Sign In or Register to comment.