Error: Not enough resources available to fulfill request

I am just starting to use FireCloud for data analysis. We are currently on the free credit program and want to start our first analysis. Our workspace name is: "fccredits-iron-jade-XX31/XXXXX_free“. After triggering the method from the available configuration “cellranger_mkfastq_count”, both me and my colleague experience issues right after job submission. The error message is below:

Call #1 (Subworkflow ID 2fd28187-015c-4adc-ae4e-4aa450581a1d):
Started:January 17, 2019, 9:48 AM (57 minutes ago)
Ended:January 17, 2019, 9:50 AM (55 minutes ago)
Failures:
message: Workflow failed
causedBy:
message: Task cellranger_mkfastq.run_cellranger_mkfastq:NA:1 failed. The job was stopped before the command finished. PAPI error code 2. The zone 'projects/fccredits-iron-jade-XX31/zones/us-central1-f' does not have enough resources available to fulfill the request. '(resource type:compute)'.

According to SO, this suggests a true limitation in Google’s compute resources. Do you think this could be the problem? Have you encountered similar scenarios in the past?

Thank you!
Best, Niklas
Tagged:

Answers

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @nrindtor

    I am moving this to the firecloud forum and @SChaluvadi will help you out with it.

  • tfarrelltfarrell Member

    Hi @SChaluvadi / @bhanuGandham,

    I also experienced the same error with a couple of our jobs submissions today (under our FireCloud project 'projects/broad-malaria-firecloud/zones/us-central1-f'). Also not sure if this is a Google or FireCloud issue.

    Best,
    Tim

  • SChaluvadiSChaluvadi Member, Broadie, Moderator admin

    @tfarrell and @nrindtor
    Thanks for reporting - I'll check to see what might be going on. @tfarrell were you able to circumvent or make your workflows run eventually?

  • SChaluvadiSChaluvadi Member, Broadie, Moderator admin

    @nrindtor Upon some digging and based on Tim's reply, it looks like this might be a transient error on the Google PAPI end. If you are able, can you try running your workflow again? us-central1-f seems to be the common keyword that tipped me off. Google recommends spreading your workload across different zones to reduce the impact of these issues on your workload or trying again. Please let me know if this continues to be an issue!

  • tfarrelltfarrell Member

    Hi @SChaluvadi,

    Sorry for the delayed reply, I've restarted the jobs and it appears they're running as expected now. Thanks for your help!

    Best,
    Tim

  • tfarrelltfarrell Member

    Hi @SChaluvadi,

    I've seen this error crop up again on subsequent jobs, which I've had to rerun a couple times now. Are there any recommendations for how to prevent this in future submissions? Or any ideas as to why this might be happening in the first place?

    One of the major motivations for doing these analyses in the cloud is that is extremely elastic with respect to compute, so this is the one error I wouldn't expect to get and I find it hard to believe that it has to do with some limitations on Google's infrastructure.

    Best,
    Tim

  • SChaluvadiSChaluvadi Member, Broadie, Moderator admin

    @tfarrell Definitely understand the roadblocks this issue presents - can I refer to the error that you posted above to our team to see if they have any suggestions or ideas on why this might be happening so consistently?

  • tfarrelltfarrell Member

    Yes, it's the same error as the original post:

    "The job was stopped before the command finished. PAPI error code 2. The zone 'projects/broad-malaria-firecloud/zones/us-central1-f' does not have enough resources available to fulfill the request. '(resource type:compute)'."

    I know you mentioned Google recommends spreading compute over a number of different zones (which makes sense), but I don't believe I have the ability to adjust this, it seems this one of the backend pieces that FireCloud configures and does not give control over to the user. I've tried going through Google Cloud Platform to see how to adjust the zones which are our jobs are submitted to, but it appears I don't have adequate permissions in the project created by FireCloud to adjust this.

  • SChaluvadiSChaluvadi Member, Broadie, Moderator admin

    @tfarrell You should be able to adjust the zones in your WDL within the runtime block as an attribute. Can you confirm if your failed workflows run successfully when you re-try them after waiting some time?

  • orrorr Member
    @SChaluvadi

    We are repeatedly getting the following error for the last 2 days, and are unable to run our WDL pipelines.
    `Task scCloud.cluster:NA:1 failed. The job was stopped before the command finished. PAPI error code 2. The zone 'projects/pilot-htapp/zones/us-central1-c' does not have enough resources available to fulfill the request. '(resource type:compute)'.`

    I think this problem is something happening on the Google Cloud end, based on stackoverflow. I can't post the link due to being new to this forum, but the suggestion there is to somehow use multiple zones.

    The WDL is written by Bo Li and Josh Gould and is named scCloud. It would not be ideal to be having to constantly changing the zone in the WDL, I want the WDL to be unchanged across my project.

    My workspace contains human data and there are access restrictions for this project. I don't think it's a data specific issue though as other people in the Regev lab have seen the same error with their datasets. Workflows fail indefinitely, meaning we run, it fails, and try another run, and it fails again. I have tried 5 times over the past 2 days.

    Thank you for your help, this is obviously very important to us and we appreciate the work you're doing to address it. Best, Orr
  • bigbadbobigbadbo Member, Broadie

    @SChaluvadi

    I have two questions related to switching zones.

    Below is an example for setting the zones:

    runtime {
    zones: "us-central1-c us-central1-b"
    }

    My questions are:

    1) I noticed (from FireCloud forum) that the syntax might be changed. If I have two zones set, should I use zones: "us-central1-c us-central1-b" or zones: ["us-central1-c", "us-central1-b"]?

    2) If I set two zones, which zone will FireCloud choose?

  • SChaluvadiSChaluvadi Member, Broadie, Moderator admin

    Hi @bigbadbo - based on the documentation for the zones in runtime blocks, the example you have listed first is the correct one; set zones with a space and no comma delimiters. As far as which zone is chosen, I am not positive but it looks like Google attempts to use the first zone and then the second if it needs more quota, as seen in this google document.

  • francois_afrancois_a Member, Broadie ✭✭

    I'm getting this error as well, running workflows that I've frequently run without problems in the past.

  • AdelaideRAdelaideR Unconfirmed, Member, Broadie, Moderator admin

    @bigbadbo and @francois_a we are asking for some clarification for the developers on this technical issue.

  • bigbadbobigbadbo Member, Broadie

    @SChaluvadi, thanks a lot for your response.

    One more question, do you think we can get rid of the resource problem if I list multiple zones?

    Thanks.

  • SChaluvadiSChaluvadi Member, Broadie, Moderator admin

    @bigbadbo We are in contact with Google to learn more about this error but according to their documentation, assigning multiple zones in case there isn't enough resource helps the workflow from aborting. I do think the best course of action until we are able to get more information is to add a few zones that have the Features that your workflow will requires.

  • SChaluvadiSChaluvadi Member, Broadie, Moderator admin

    @orr @bigbadbo @tfarrell @nrindtor @francois_a
    Update on Resource Failure Issue:
    Google is aware of the elevated error rates being observed in the us-central1-f zones. Their team is working to fix the issue and hopes to have a full fix rolled out in the next few days. They have provided some workarounds. For help migrating instances across zones, these docs [1] [2] can serve as a reference. I will let you all know of any updates as I receive them - sorry for the inconvenience!

  • bigbadbobigbadbo Member, Broadie
  • bigbadbobigbadbo Member, Broadie

    @SChaluvadi ,

    I have changed my WDLs to include multiple zones and the WDL was successfully finished.

    However, I do not know where I can find which region and zone this job used. Do you know if I can find the region/zone information in the output logs?

    Thanks,
    Bo

  • SChaluvadiSChaluvadi Member, Broadie, Moderator admin

    @bigbadbo I am checking on this for you - will reply back with an update.

  • SChaluvadiSChaluvadi Member, Broadie, Moderator admin

    @bigbadbo Here are some screenshots and steps to determine which zone was ultimately used when you ran your workflow.

    Step 1: Click on the operation id. Starts with "operations/...."

    Step 2: Scroll to the bottom of the pop-up and check for the key "zone"

    Hope this helps!

  • bigbadbobigbadbo Member, Broadie

    @SChaluvadi,

    Now my WDL has a list of 5 zones. However, I still encounter the same error: "Task scCloud.cluster:NA:1 failed. The job was stopped before the command finished. PAPI error code 2. The zone 'projects/manton-ica-1m/zones/us-central1-c' does not have enough resources available to fulfill the request. '(resource type:compute)'."

    Please let me know what you think.

    Thanks,
    Bo

  • SChaluvadiSChaluvadi Member, Broadie, Moderator admin
    edited February 1

    @bigbadbo I have relayed this updated information to the team and will get back to you as soon as I hear back.

    Could you list the 5 zones that you have added to your WDL. Even just the code copied into this post would be great!

  • SChaluvadiSChaluvadi Member, Broadie, Moderator admin
    edited February 1

    @bigbadbo Can you narrow down your zones to us-central1 a, b, c, f and re-try? We think the supply situation in c is still an issue and b has been seeing issues lately too. F seems to be recuperating from its "downtime". If you keep hitting the error, the team suggests trying a, f as they seem to be healthiest as of this moment.

    There is also a possibility that the amount of compute resources being requested are possibly much higher than whats available in a specific zone. If you would be able to share your workspace with the G[email protected] user, we can take a close look. One way for you to check this is to take note of the same task failing repeatedly with the same inputs.

  • francois_afrancois_a Member, Broadie ✭✭

    I'm still running into this issue (my WDLs don't specify a zone). Why does this result in task failures rather than queuing?

  • jnomsjnoms Member
    edited February 27
    @SChaluvadi I'm still having this problem. It seems that a new zone is acting up every day, and it's pretty tedious to have to constantly resubmit jobs on different zones... latest failure is central1-f, before that it was central1-c, and on and on. Is there a solution in the works, or is there anything I can do to resolve this? Or is there someone I can contact about this?
  • tfarrelltfarrell Member
    edited February 27

    Hi all,

    I've changed the zones runtime parameters for all my pipelines and their tasks to "us-east4-a us-east4-b us-east4-c", which seems to have resolved all these issues and not resulted in performance drops (as far as I can tell).

    Best,
    Tim

  • SChaluvadiSChaluvadi Member, Broadie, Moderator admin

    @jnoms
    As mentioned above, can you try the zones listed to see if you still get the same issues.

Sign In or Register to comment.