Multiple issues with FC workflow execution & call caching
Hi FC team,
Our team has been unable to get any workflows to successfully complete since mid-afternoon yesterday, Tues, Oct 2. We have experienced these issues across multiple workspaces, workflows, and configs.
There are no persistent errors, but here is a sampling of the issues we have encountered:
Workflow died because of a
temporary server error(example: https://portal.firecloud.org/#workspaces/talkowski-sv-gnomad-wgs-v2/SV_Talkowski_GNOMAD_WGS-V2/monitor/43065914-ef9b-4634-81b2-8675b1176ca5/b02e2598-e568-447c-8eac-43ba6bf40185)
Tasks in workflows with call caching disabled suddenly are spending 1-2 hours in a
CheckingCacheEntryExistencestate (example: task
CleanVCF.Clean4.combine_multi_IDsin Call #2 here: https://portal.firecloud.org/#workspaces/talkowski-sv-gnomad-wgs-v2/SV_Talkowski_GNOMAD_WGS-V2/monitor/4678e997-47d2-4fbe-8c4f-4c65120e341b/d9c779da-d6eb-4a31-b5a5-0321b6b46823)
Workflows with call caching enabled not launching for over 12 hours (example: https://portal.firecloud.org/#workspaces/talkowski-sv-gnomad-wgs-v2/SV_Talkowski_GNOMAD_WGS-V2/monitor/555bda8d-6cdd-45ec-b91d-7b89d7ac83b6/1357a70f-afad-4a77-8dcb-9f6a7804ac29)
Tasks failing because they aren't able to find outputs from previous tasks, despite these outputs existing in the
gs://bucket and looking correct when downloaded & investigated locally (example: task
CleanVCF.cleanvcf5in Call #2 here: https://portal.firecloud.org/#workspaces/talkowski-sv-gnomad-wgs-v2/SV_Talkowski_GNOMAD_WGS-V2/monitor/4678e997-47d2-4fbe-8c4f-4c65120e341b/d9c779da-d6eb-4a31-b5a5-0321b6b46823)
I suspect these issues could be related to the following two posts from yesterday by @jgould and @Chip:
At this point, we are completely stalled on all workspaces, and don't want to launch any new workflows due to these unpredictable errors and long queue times.
Any idea what could be going on, or how long we should expect this behavior to persist?
Thanks a lot,
Ryan & the Talkowski lab