For WDL questions, see the WDL specification and WDL docs.
For Cromwell questions, see the Cromwell docs and please post any issues on Github.
Cromwell workflows appear hung
I've been running workflows on Google and have begun to observe that workflows don't have any jobs running (as assessed by filtering for VM labels for the cromwell workflow id in question), but are still listed as "Running" when querying cromwell about their status. The oldest of these has been running for nearly 6 days with no running jobs for the last two. At this point I believe these workflows are "stuck" in a running status. I did restart the cromwell process once to bring it up with more memory allowed for the java process, but only the oldest workflow spans this restart.
I've been regularly submitting new workflows to try to keep the number of running jobs at the maximum which is currently limited by my quota for assigned IP addresses. There is only a single WDL workflow being utilized (https://github.com/hall-lab/sv-pipeline/blob/post_merge_alterations/scripts/Post_Merge_Gt_Cn.wdl).
I have Cromwell (version: "26-22fe860-SNAP") running on a 4 core, 26GB VM (n1-highmem-4) running Ubuntu 14.04.5 LTS. I've attempted to keep an eye on RAM utilization and saw it peak for the cromwell process around 13GB or so. I've never seen less than 10GB of RAM free on the entire machine however. I have a separate VM (n1-standard-2) for mysql running Ubuntu 16.04.2 LTS. It appears I'm occasionally saturating the database VM's CPU (likely by asking for metadata for large workflows). Both VMs have 40GB persistent disks attached and plenty of space free.
I've searched the github issues and forum for similar issues and have been unable to find them. At this point, I'm curious if:
- There is any additional information I could provide to help diagnose what's going on?
- Are there are any tricks I could try that might rescue the stuck workflows? Call caching is disabled for most of these because of naming issues with the input data bucket (it contains underscores) so I don't believe I can simply relaunch without incurring significant expense.