Workflows are getting stuck at various tasks

mleventhalmleventhal ✭✭Cambridge, MAMember, Broadie ✭✭

Hello,

I have experienced many instances of tasks getting stuck while running today. In the first screenshot, I have about 227 out of 387 runs that have been stuck in the aborting phase since 11:40AM today. I had aborted these after I noticed they were taking especially long to run. When I looked at these workflows individually, I saw that these workflows were stuck in two ways

1) The task had been launched but then it did not run (the only item found in the google bucket with respect to this submission ID was the script file, no rc or log files)

2) The task was actually completed (all outputs were in the bucket and the VM was closed) but would not go to "completed" state.

In either case, the VMs seem to be spun down: I saw no API activity a few minutes after the point at which I had hit "abort".

Additionally, I have a workflow that has been stuck in a call cache-d task for 40 minutes, for a task that without call cache-ing runs in three (as seen in the second screenshot).

Is there a reason that my tasks are all getting stuck? The workspace is ebert-fc/exac-maf-aggregation; it is already shared with [email protected] Thank you!

Best,
Matt


Best Answers

Answers

  • bshifawbshifaw admin Member, Broadie, Moderator admin

    Hi Matt,
    [email protected] is an older group thats no longer being used, please share the workspace with [email protected]org.
    Thanks

  • mleventhalmleventhal ✭✭ Cambridge, MAMember, Broadie ✭✭

    Added! As a follow-up, is this possibly related to the compute engine incident mentioned in the google console home?

  • francois_afrancois_a ✭✭ Member, Broadie ✭✭

    I'm seeing the same issue.

  • bshifawbshifaw admin Member, Broadie, Moderator admin

    I can view the workspace, thanks for sharing. I'm also getting in touch with dev team now to check if this is a broader problem.

  • aednicholsaednichols Member, Broadie

    @mleventhal in reference to the GCP status, are your jobs using a SUSE Linux that requires a license?

  • mleventhalmleventhal ✭✭ Cambridge, MAMember, Broadie ✭✭

    I do not think so. I should note that the error has reportedly cleared, but I am still experiencing the same issues with respect to launching workflows

  • bigbadbobigbadbo Member, Broadie

    I observe same thing in my workspace.

  • mleventhalmleventhal ✭✭ Cambridge, MAMember, Broadie ✭✭

    Further updates: I noticed that some of my tasks that were stalled in "aborting" had made it to the "aborted" status. To test to see if this meant that I could launch a task, I ran a workflow on two samples: one that had some calls that were cached, and another with none.

    Both were stalled, and the one that had some calls cached made it through two tasks with call caching, and now is stuck in "queued in cromwell" status.

    Good news: it no longer seems like a Google issue, but the bad news is that there are new issues that are blocking workflows

  • aednicholsaednichols Member, Broadie

    @francois_a @bigbadbo @mleventhal thank you for the reports. The dev team is investigating.

  • SaloniShahSaloniShah Member

    @mleventhal

    Would you mind sharing the method configs in that workspace as well?

  • aednicholsaednichols Member, Broadie

    Hi folks - thanks for your continued patience.

    From around 11:30 AM to 4:15 PM today our Cromwell instance experienced heavy CPU load and workflow statuses were extremely slow to update, to the point where the system appeared unresponsive.

    Under the covers, workflows were still succeeding or aborting normally, and what you see in FireCloud should be caught up to reality if not now, then very soon.

    We are digging into the cause of high CPU in the hope of understanding and preventing occurrences like this.

  • mshandmshand ✭✭ Member, Broadie, Dev ✭✭

    @aednichols I submitted a new workflow this morning and about 4 tasks successfully call cached very quickly, but now it seems to be stuck again. One task is in a "Running" state, but should have been done within 5 minutes and it's been about 20 min so far. The workflows I had submitted on Friday did eventually finish over the weekend, but it looks like something is possibly still a bit slow?

  • KateNKateN admin Cambridge, MAMember, Broadie, Moderator admin

    Over the weekend we have seen very high load on the system. This morning, the load increased even more, which is likely leading to the slowness you are experiencing. Our developers are investigating a way to mitigate this high load issue, and we thank you for your patience in the meantime.

  • RLCollinsRLCollins ✭✭ Harvard Medical SchoolMember ✭✭

    @KateN For what it's worth, I wanted to chime in here: we are encountering similarly related issues across a variety of workspaces for the Talkowski lab. All workflows we have launched this morning get stuck in the Submitted status for upwards of 30 minutes before flipping to the Running status for any subworkflows/tasks. If useful, an example workflow is submission ID 1bb93759-1da2-450e-bf2d-b05a01663d8c in the workspace talkowski-sv-gnomad-wgs-v2/SV_Talkowski_GNOMAD_WGS-V2. Let us know if there's anything additional I can provide to help diagnose the issue.

  • KateNKateN admin Cambridge, MAMember, Broadie, Moderator admin

    Thank you @RLCollins. I've passed your information on to the developers to see if it will help them. They are actively working on this issue.

  • mleventhalmleventhal ✭✭ Cambridge, MAMember, Broadie ✭✭

    I am not experiencing the lag in "Submitted" status, but I am now receiving "Queued in Cromwell" messages for all of my jobs (same workspace as mentioned before). The method that has been getting stuck (allele-fraction-germline-filter, snapshot 25), is already shared with GROUP_FireCloud-Support

  • KateNKateN admin Cambridge, MAMember, Broadie, Moderator admin

    I've gotten an update from the developers. Our workflow engine is at max capacity and jobs have started being queued up. There may be latency in new jobs starting, however once things are running, they should complete as expected.

    You should not abort your jobs in the meantime, just be patient as the jobs will launch once the queue clears up. I also wanted to mention that you don't pay anything when your workflow is sitting in the "queued" state; there won't be an additional cost.

    We are constantly working to improve our systems to be able to increase our max capacity. Thank you for your patience in the meantime.

  • cbaocbao ✭✭ Member, Broadie ✭✭

    Thanks @KateN . Your message is very helpful!

  • mleventhalmleventhal ✭✭ Cambridge, MAMember, Broadie ✭✭
    edited August 2018

    Adding to this thread again to state that I am seeing Queued in Cromwell messages
    again this morning, probably due to the same cause.

  • mshandmshand ✭✭ Member, Broadie, Dev ✭✭

    I'm seeing some slowness this morning as well. I'm not seeing Queued in Cromwell in particular, but workflows are slow to start and tasks have now been "Running" for longer than expected (and don't have anything in the google bucket, not even a script file).

  • mleventhalmleventhal ✭✭ Cambridge, MAMember, Broadie ✭✭

    Mine have finally started running, but now experience no call caching

  • RLCollinsRLCollins ✭✭ Harvard Medical SchoolMember ✭✭
    edited August 2018

    Same as @mshand on our end, too. Of three workflows I've launched this morning, all get queued successfully and are marked as Running on the Monitor, but no subworkflow details are populated (after >20 minutes of Running), and the submission bucket is never populated with any files.

    Example submission:
    Submission ID: 8f9f7581-4c18-49f2-a076-a3f82a66d36b
    Workspace: talkowski-sv-gnomad-wgs-v2/SV_Talkowski_GNOMAD_WGS-V2

    Thanks!
    Ryan

  • Tiffany_at_BroadTiffany_at_Broad admin Cambridge, MAMember, Administrator, Broadie, Moderator admin

    Hi everyone - we are aware of the slowness. We are looking into the other concerns as well. I will post back with more info when I have it.

  • Tiffany_at_BroadTiffany_at_Broad admin Cambridge, MAMember, Administrator, Broadie, Moderator admin

    @mleventhal - the Cromwell team has been investigating why Cromwell is having call-caching issues since early this morning.

  • RLCollinsRLCollins ✭✭ Harvard Medical SchoolMember ✭✭

    Hi @Ruchi, thanks for the detailed update and the explanations.

    For what it's worth, the problem with tasks waiting sometimes >an hour before being able to check the cache and/or being scheduled and started has persisted or--if anything--actually gotten worse.

    It's becoming a nontrivial hamstring of our progress on multiple projects, so we would definitely appreciate being kept in the loop on any changes/updates/upgrades that are expected to ameliorate these issues.

    Thanks!
    Ryan

  • mleventhalmleventhal ✭✭ Cambridge, MAMember, Broadie ✭✭

    I am experiencing the same issues @RLCollins described

  • KateNKateN admin Cambridge, MAMember, Broadie, Moderator admin

    @RLCollins We certainly hear you. We are refocusing our efforts to better keep you, and all our users, in the loop on the issues we've been experiencing and our plans for handling those issues. If you have specific questions, certainly feel free to ask in the forum. If you're looking for just news and information, keep your eye on our Blog as well as banners on the forum (the top of your page here) and on the FireCloud portal itself. We tend to reserve portal banners for emergency notices, as the banners there only come in red. For more general news and updates, the blog and forum banners are the place to go.

  • lelaginalelagina Member, Broadie

    Hello,

    I am experiencing the same issues since last night for various workflows.

    Thank you,
    Luda.

  • agraubertagraubert Member, Broadie

    +1 I am also experiencing this issue

  • KateNKateN admin Cambridge, MAMember, Broadie, Moderator admin

    @mleventhal @lelagina @agraubert
    To make sure that the issue you're experiencing is the known issue we are working on fixing, could you please share your workspace(s) with [email protected]? If it is a different issue, we may be able to mitigate it in a different way.

  • mleventhalmleventhal ✭✭ Cambridge, MAMember, Broadie ✭✭

    The workspace is ebert-fc/exac-maf-aggregation, already shared with [email protected]

  • KateNKateN admin Cambridge, MAMember, Broadie, Moderator admin

    Thank you @mleventhal, we are looking into this.

  • KateNKateN admin Cambridge, MAMember, Broadie, Moderator admin

    @mleventhal Our developers looked into your workspace and saw that the currently running one had moved past the pending status. It had a large number of inputs, which resulted in the job taking a while to update its status. It was still proceeding as it should be, it was just slow with updating FireCloud on its progress.

    This is the same issue we are aware of, and we are actively working to fix it.

  • lelaginalelagina Member, Broadie

    Hello Firecloud Team,

    My workflows are getting stuck at "Starting". The workflow itself has only 8 input parameters.

    Thank you,
    Luda.

  • lelaginalelagina Member, Broadie

    30 minutes later:

  • lelaginalelagina Member, Broadie
  • lelaginalelagina Member, Broadie

    This workflow started running.

  • bshifawbshifaw admin Member, Broadie, Moderator admin

    Thanks for sharing @lelagina .
    If you're having the same problem as the other users then you may benefit from a fix the developers are currently working on.
    If you come across the error again and believe it's different from the others then you can share your workspace with
    [email protected] and tell us the name of the workspace so we can diagnose the problem.

  • lelaginalelagina Member, Broadie

    Hello @bshifaw ,

    Thank you for letting me know about the Firecloud fix. Would you know by any chance when this fix is coming out as I experience the same waiting as yesterday:

    Thank you,
    Luda.

  • KateNKateN admin Cambridge, MAMember, Broadie, Moderator admin

    @lelagina You can see Ruchi's reply here for estimated timelines of some fixes she and her team are currently working on. I'm reaching out to Ruchi again to see if we have any updates to those numbers, but you should use her earlier post as a guide for now.

  • KateNKateN admin Cambridge, MAMember, Broadie, Moderator admin

    I've spoken with Ruchi, and the estimates she gave earlier are still on track. This work is currently their top priority, and we are doing everything we can to get this fixed asap. We thank you for your patience in the meantime, and I will be sure to update when we have a fix, or if the timeline changes. If you encounter an issue you believe to be caused my something different than the issues described in this thread, please report them and we will be happy to look into them.

Sign In or Register to comment.