LATEST RELEASE: July 17, 2018
Release Notes can be found here.

connect timed out

esalinasesalinas BroadMember, Broadie

Today I made a submission of a single data entity (pair). And got a "connect timed out"
message. I immediately resubmitted and the error was not observed the second time.
It would seem that since the submission did not result in the timeout error the second time
that the issue is transient.

Issue · Github
by Geraldine_VdAuwera

Issue Number
1611
State
closed
Last Updated
Closed By
vdauwera

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Thanks for reporting this; we're now keeping track of timeouts and will add this one to the tally. If you get other "connection timed out" failures, please post a comment in this same thread.

  • birgerbirger Member, Broadie, CGA-mod

    I just encountered another instance of this: a connect timed out message. This time I was just launching a workflow on a set of two pairs...one workflow launched successfully, the second reported "connect timed out"

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Noted, thanks.

  • birgerbirger Member, Broadie, CGA-mod

    Just encountered 9 additional instances of connect timeouts.

  • birgerbirger Member, Broadie, CGA-mod

    And, in a separate workspace, two instances of read timeout and 8 instances of "connect timed out"

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Yikes, that's a lot of timeouts. Can you tell me if you get as many timeouts in the next few days? We'll want to know if this morning's reboot reduces their occurrence or not.

  • birgerbirger Member, Broadie, CGA-mod

    Yesterday we submitted an analysis that launched 1000 workflows. Currently, we see 985 successful completions, 12 failures and 3 still running. The workflows in the running state are probably "stuck", but I haven't looked at them closely yet. Of the 12 failures, we have six "connect timed out", one "POST request to /api/workflows/v1/batch timed out" and five other failure conditions that I have not looked into yet. I've attached a screen shot of the failed workflows.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Noted, thank you. We'll see if the team can do anything about these timeouts.

  • birgerbirger Member, Broadie, CGA-mod

    Another case of timeouts. Submitted another analysis that launched 1000 workflows. report is 986 succeeded, 10 failures and 4 still running. Of the 10 failures, 9 were reported as timeouts. Other failure, (one where got assigned a workflow id) is case where the workflow never got started: no reported calls, no directory on bucket for corresponding to workflow.

    For the four workflows still reported as running, in each of three cases, a mutect1 scatter job is stuck in the "Running State", but if you look at the job status using "gcloud alpha genomics operations describe", it reports the jobs as done. So cromwell is not correctly tracking the state of jobs. The fourth reported running workflow is a case of the workflow never getting started - no reported calls.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie
    Thanks for reporting this. The Cromwell team is working on some fixes that should improve reliability and reduce the occurrence of such issues.
  • birgerbirger Member, Broadie, CGA-mod

    Thank you. Any idea when we can expect to see those improvements in FireCloud?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Frankly, it's going to be a while -- I would be surprised if it happened before the end of the quarter. Once the Cromwell team is done making improvements (which will likely be a few more weeks), the Workbench team then needs to get that new version of Cromwell into FireCloud, then the whole thing needs to go through QA for testing before public release. Typically that process (post-Cromwell release) takes about a month.

  • birgerbirger Member, Broadie, CGA-mod

    OK...I'll let the team know we will need to live with these timeouts through the end of this quarter, but we can expect them to go away the beginning of April.

    Are you still tracking occurrences of timeouts? Do you want us to continue reporting them?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie
    Thanks, I think we've got enough information at this point and don't need any further reports, unless you see something that produces new/different symptoms.
Sign In or Register to comment.