Issues with accessing notebook cluster

sleeslee ✭✭✭Member, Broadie, Dev ✭✭✭
edited June 2018 in Ask the FireCloud Team

I have been getting a lot of these messages today when trying to access my notebook cluster (broad-dsde-firecloud/validation):

The server was not able to produce a timely response to your request.
Please try again in a short while!

Stopping and restarting the cluster is the only thing that seems to resolve the issue, but after a short time (~30 minutes to an ~hour) I lose access again. Refreshing my FireCloud browser and reopening the cluster (which seemed to help with similar issues in the past) doesn't do anything.

Best Answer

Answers

  • rtitlertitle Member, Broadie, Moderator, Dev

    This error usually means an internal timeout occurred and the server aborted the request. I am looking at logs to investigate the root cause, I will update here when I know more.

  • rtitlertitle Member, Broadie, Moderator, Dev

    Hi slee,

    I dug a little more. I have a couple questions:

    1. The cluster is stopped right now, but I still see requests happening every 2 minutes like this:

    PUT /notebooks/broad-firecloud-dsde/validation/api/contents/cnv-validation-WXS-optimize-penalty
    -factor.ipynb

    It looks like something is trying to upload a notebook every 2 minutes. This is failing each time because the cluster is stopped. I don't think it's harming anything but I was just curious what would be calling this -- is it some script?

    1. I see the timeout error in the logs, it looks like it's in response to loading the above notebook. We have a proxy in front of Jupyter, and it looks like the proxy is hitting a server timeout of 1 minute and returning that "Please try again in a short while" message. So I would like investigate the Jupyter server logs to see if there something is failing there. If you start your cluster again can you let me know so I can dig more?

    Thanks!

    Rob

  • sleeslee ✭✭✭ Member, Broadie, Dev ✭✭✭

    I've restarted the cluster---hope that helps you diagnose. I'm not sure why that request is happening every two minutes? I had that notebook still opened in a window even after I paused the cluster, so could they be autosave attempts?

  • rtitlertitle Member, Broadie, Moderator, Dev

    Oh yeah autosave makes sense. Thanks - I'll take a look.

  • sleeslee ✭✭✭ Member, Broadie, Dev ✭✭✭

    Were you able to diagnose, @rtitle? Let me know if I can pause the cluster. Incidentally, I just started getting those messages again a few minutes ago.

    More generally, I am having trouble completing some relatively long-running computations (~30 minutes+) within my notebook before I seem to lose connection or authentication, which kills the notebook. As I mentioned above, sometimes refreshing the FireCloud browser window or reopening the notebooks tab seems to bring the connection back, but I'm not sure if this is intended behavior. Any tips?

  • sleeslee ✭✭✭ Member, Broadie, Dev ✭✭✭

    Yup, my kernel died in the middle of a computation so I decided to bump the memory up. Hopefully the new cluster will indeed be better behaved. Thanks for keeping an eye on things!

  • sleeslee ✭✭✭ Member, Broadie, Dev ✭✭✭

    Looks like I'm running into the same issues with the new cluster.

  • rtitlertitle Member, Broadie, Moderator, Dev

    Hi @slee, sorry for the delay on this. I see the cluster is stopped -- tomorrow if you're around can you start it again and post here (or slack me @rtitle) when you do?

    I'm also a bit curious about the analysis you're trying to run. You mentioned long-running jobs -- are these Spark/Hail jobs? If so then you might benefit from adding worker nodes (right now it is a single-node cluster). If not using Spark/Hail but it is a CPU-intensive task then bumping from n1-highmem-4 -> n1-highmem-8 might be an option. Basically I'm wondering if your computation is chewing up so much resources that the Jupyter server starts timing out.

  • sleeslee ✭✭✭ Member, Broadie, Dev ✭✭✭

    The cluster is up at the moment.

    I'm running some postprocessing code on results generated with the GATK CNV pipeline to create plots, etc. This involves grabbing many results from various buckets and so takes a bit of time, ~30 minutes. I haven't encountered the message again in the past few days. Not sure if the original issue was the result of limited CPU/memory.

  • sleeslee ✭✭✭ Member, Broadie, Dev ✭✭✭

    Hi @rtitle,

    I've finished my computations for the time being, so I went ahead and paused the cluster. Still haven't encountered the message again.

Sign In or Register to comment.