To celebrate the release of GATK 4.0, we are giving away free credits for running the GATK4 Best Practices pipelines in FireCloud, our secure online analysis portal. It’s first come first serve, so sign up now to claim your free credits worth $250. Sponsored by Google Cloud. Learn more at https://software.broadinstitute.org/firecloud/documentation/freecredits

Cromwell workflows appear hung

ernfridernfrid Saint LouisMember

I've been running workflows on Google and have begun to observe that workflows don't have any jobs running (as assessed by filtering for VM labels for the cromwell workflow id in question), but are still listed as "Running" when querying cromwell about their status. The oldest of these has been running for nearly 6 days with no running jobs for the last two. At this point I believe these workflows are "stuck" in a running status. I did restart the cromwell process once to bring it up with more memory allowed for the java process, but only the oldest workflow spans this restart.

I've been regularly submitting new workflows to try to keep the number of running jobs at the maximum which is currently limited by my quota for assigned IP addresses. There is only a single WDL workflow being utilized (https://github.com/hall-lab/sv-pipeline/blob/post_merge_alterations/scripts/Post_Merge_Gt_Cn.wdl).

I have Cromwell (version: "26-22fe860-SNAP") running on a 4 core, 26GB VM (n1-highmem-4) running Ubuntu 14.04.5 LTS. I've attempted to keep an eye on RAM utilization and saw it peak for the cromwell process around 13GB or so. I've never seen less than 10GB of RAM free on the entire machine however. I have a separate VM (n1-standard-2) for mysql running Ubuntu 16.04.2 LTS. It appears I'm occasionally saturating the database VM's CPU (likely by asking for metadata for large workflows). Both VMs have 40GB persistent disks attached and plenty of space free.

I've searched the github issues and forum for similar issues and have been unable to find them. At this point, I'm curious if:

  1. There is any additional information I could provide to help diagnose what's going on?
  2. Are there are any tricks I could try that might rescue the stuck workflows? Call caching is disabled for most of these because of naming issues with the input data bucket (it contains underscores) so I don't believe I can simply relaunch without incurring significant expense.

Thank you!

Tagged:

Best Answer

Answers

  • ernfridernfrid Saint LouisMember

    I haven't tried to abort the workflows, but I'm using ContinueWhilePossible mode so that may be the issue then.
    I'm in the middle of a rather sizeable project and had been delaying on upgrading, but it sounds like perhaps that was the wrong move.

    Do you happen to know if I'm liable to see these resume if they're still "Running" and I upgrade to Cromwell 29? I tried stopping and restarting the Cromwell process yesterday evening in the hope that they would recover proper state, but it had no effect.

  • mcovarrmcovarr Cambridge, MAMember, Broadie, Dev

    To be honest I'm not 100% sure, but I would doubt restarting with Cromwell 29 in place will unstick these jobs. But at least the upgrade should prevent the problem from happening again.

  • mcovarrmcovarr Cambridge, MAMember, Broadie, Dev

    Also it's possible I'm wrong and these jobs will unstick, possibly depending on the way they got stuck. I'm talking it over with a fellow Cromwell developer and she had a different opinion than mine. :smile:

  • ernfridernfrid Saint LouisMember

    Seems worth a try at least. I don't think I'll be any worse off, but I'd probably wait until I don't have any more running jobs.

  • ernfridernfrid Saint LouisMember

    @mcovarr - do you know if I need to apply database migrations in serial (e.g. cromwell 27 -> 28 -> 29)? Or can I go straight to cromwell 29?

  • mcovarrmcovarr Cambridge, MAMember, Broadie, Dev

    When you start up Cromwell it will apply all the required migrations in the correct order.

  • mcovarrmcovarr Cambridge, MAMember, Broadie, Dev

    Hmm that may not have answered the question you actually asked -- you can go straight to Cromwell 29. :smile:

  • ernfridernfrid Saint LouisMember

    Thanks @mcovarr. I am going to wait until I have no more running jobs and then I'll upgrade to Cromwell 29 and cross my fingers.

  • ernfridernfrid Saint LouisMember

    Upgrading to Cromwell 29 did, in fact, unstick all of my stuck workflows. They are now all failed which I assume is the correct final state.
    Thanks @mcovarr!

  • mcovarrmcovarr Cambridge, MAMember, Broadie, Dev

    Excellent, glad to hear that! :smile:

Sign In or Register to comment.