Holiday Notice:
The Frontline Support team will be slow to respond December 17-18 due to an institute-wide retreat and offline December 22- January 1, while the institute is closed. Thank you for your patience during these next few weeks. Happy Holidays!

Latest Release: 12/4/18
Release Notes can be found here.

connect timed out

esalinasesalinas BroadMember, Broadie ✭✭✭

Today I made a submission of a single data entity (pair). And got a "connect timed out"
message. I immediately resubmitted and the error was not observed the second time.
It would seem that since the submission did not result in the timeout error the second time
that the issue is transient.

Issue · Github
by Geraldine_VdAuwera

Issue Number
1611
State
closed
Last Updated
Closed By
vdauwera

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Thanks for reporting this; we're now keeping track of timeouts and will add this one to the tally. If you get other "connection timed out" failures, please post a comment in this same thread.

  • birgerbirger Member, Broadie, CGA-mod ✭✭✭

    I just encountered another instance of this: a connect timed out message. This time I was just launching a workflow on a set of two pairs...one workflow launched successfully, the second reported "connect timed out"

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Noted, thanks.

  • birgerbirger Member, Broadie, CGA-mod ✭✭✭

    Just encountered 9 additional instances of connect timeouts.

  • birgerbirger Member, Broadie, CGA-mod ✭✭✭

    And, in a separate workspace, two instances of read timeout and 8 instances of "connect timed out"

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Yikes, that's a lot of timeouts. Can you tell me if you get as many timeouts in the next few days? We'll want to know if this morning's reboot reduces their occurrence or not.

  • birgerbirger Member, Broadie, CGA-mod ✭✭✭

    Yesterday we submitted an analysis that launched 1000 workflows. Currently, we see 985 successful completions, 12 failures and 3 still running. The workflows in the running state are probably "stuck", but I haven't looked at them closely yet. Of the 12 failures, we have six "connect timed out", one "POST request to /api/workflows/v1/batch timed out" and five other failure conditions that I have not looked into yet. I've attached a screen shot of the failed workflows.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Noted, thank you. We'll see if the team can do anything about these timeouts.

  • birgerbirger Member, Broadie, CGA-mod ✭✭✭

    Another case of timeouts. Submitted another analysis that launched 1000 workflows. report is 986 succeeded, 10 failures and 4 still running. Of the 10 failures, 9 were reported as timeouts. Other failure, (one where got assigned a workflow id) is case where the workflow never got started: no reported calls, no directory on bucket for corresponding to workflow.

    For the four workflows still reported as running, in each of three cases, a mutect1 scatter job is stuck in the "Running State", but if you look at the job status using "gcloud alpha genomics operations describe", it reports the jobs as done. So cromwell is not correctly tracking the state of jobs. The fourth reported running workflow is a case of the workflow never getting started - no reported calls.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
    Thanks for reporting this. The Cromwell team is working on some fixes that should improve reliability and reduce the occurrence of such issues.
  • birgerbirger Member, Broadie, CGA-mod ✭✭✭

    Thank you. Any idea when we can expect to see those improvements in FireCloud?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Frankly, it's going to be a while -- I would be surprised if it happened before the end of the quarter. Once the Cromwell team is done making improvements (which will likely be a few more weeks), the Workbench team then needs to get that new version of Cromwell into FireCloud, then the whole thing needs to go through QA for testing before public release. Typically that process (post-Cromwell release) takes about a month.

  • birgerbirger Member, Broadie, CGA-mod ✭✭✭

    OK...I'll let the team know we will need to live with these timeouts through the end of this quarter, but we can expect them to go away the beginning of April.

    Are you still tracking occurrences of timeouts? Do you want us to continue reporting them?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
    Thanks, I think we've got enough information at this point and don't need any further reports, unless you see something that produces new/different symptoms.
Sign In or Register to comment.