Heads up:
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.

Request Timed Out (submission on sample_set entity with 8891)

esalinasesalinas BroadMember, Broadie ✭✭✭
edited September 2017 in Ask the FireCloud Team

I have a WDL and a method configuration for it. The root entity type of the configuration is a sample_set.

I have two submissions from this configuration on a sample_set. The number of samples in the sample set is 8891. The WDL calls first for a "scatter" whose width is equal to the number of items in the sample set. Then, a single "gather" step takes in an array from the scatters to "gather" them.

The two submissions are referenced here where you can see the workspace name and submission ID.

https://portal.firecloud.org/#workspaces/broad-firecloud-testing/hg38_PoN_Creation_copy/monitor/bbf0ecec-a742-4749-804c-9e390035b3ba/98ac216e-7e26-4cb2-b74e-84df71e63e79

https://portal.firecloud.org/#workspaces/broad-firecloud-testing/hg38_PoN_Creation_copy/monitor/fc3108b0-abcf-41b9-b273-5652f02439b7/903efa1a-1ebf-4246-81b0-4f3339f3ead5

For both submissions (from already being in the monitor tab) when I go to the submission in the UI I get a message:

Server Unavailable
FireCloud service is temporarily unavailable. If this problem persists, check http://status.firecloud.org/ for more information.

While trying to do so I have my console open for the network tab and there I see:

{
  "statusCode": 504,
  "source": "FireCloud",
  "timestamp": 1506518502605,
  "causes": [],
  "stackTrace": [],
  "message": "Request Timed Out"
}

For other submissions I encounter no error message and I can enter them in the UI no problem.

It seems that there's a timeout and based on previous experience I speculate that the timeout is related to the size of the sample_set (8891) because for the same WDL I can view submissions on sample_sets with smaller size (1) without encountering any problems.

Answers

  • Tiffany_at_BroadTiffany_at_Broad Cambridge, MAMember, Administrator, Broadie, Moderator admin

    Hi Eddie!

    Thanks for this awesome report. Can you share the workspace with [email protected] and we can confirm?

    Many thanks,
    Tiff

  • esalinasesalinas BroadMember, Broadie ✭✭✭

    hi @Tiffany_at_Broad , I just deleted that workspace (on purpose) earlier today.....

    -eddie

  • KateNKateN Cambridge, MAMember, Broadie, Moderator admin

    I'm so sorry there was a delay in replying to this thread; I am still catching up. Without a workspace to see the example in, it will be harder to replicate the results you saw, however I do still think it is important to know why a workflow timed out on you like this. If we don't support such a wide scatter currently, then perhaps there are improvements that can be made. I will have a developer take a look soon.

  • esalinasesalinas BroadMember, Broadie ✭✭✭
    edited October 2017

    hi @KateN , I think that the system would support scatter on that width. I launched a job numerous times and some of the scatters (of that width) succeeded....BUT it seems that on occasions where there was some error (in all cases seemingly transient) that those would have timeouts. On the various submissions, I made maybe 10-15 of them, most of them failed with some transient error. I linked to three of them in another thread. such wide scatter and with call-caching doing copy of results much bucket storage was used.

    The bucket storage costs were very very high and that motivated deletion of the workspace and bucket.

  • KateNKateN Cambridge, MAMember, Broadie, Moderator admin

    Could you link the three you mentioned in another thread, or link that thread here so I can investigate?

  • KateNKateN Cambridge, MAMember, Broadie, Moderator admin

    After discussing with a developer, we think there is a misunderstanding about the "service unavailable" and the timeout error. These are referring to interacting with FireCloud (like trying to get the status of the submission). It is not related to the actual job submission. Cromwell could still be processing the job and it actually sounds like it did given the amount of data that resulted from the analysis and subsequent bucket storage costs. Were there actual job failures when you were eventually able to get the job status, if you can recall? Unfortunately with the workspace deleted we can't investigate much further, only speculate as to the cause.

Sign In or Register to comment.