Attention:
The front line support team will be unavailable to answer questions until May 27th 2019 as we are celebrating Memorial Day. We will be back soon after. Thank you for your patience and we apologize for any inconvenience!
Latest Release: 05/01/19
Release Notes can be found here.

Quota on GPUs per billing account leads to job failures rather than queueing

amaroamaro Broad InstituteMember, Broadie

Hi Firecloud team,

I am running a gpu task on ~1300 samples and have started running into this error: The job was stopped before the command finished. PAPI error code 2. failed to insert instance: googleapi: Error 403: Quota 'NVIDIA_K80_GPUS' exceeded. Limit: 64.0 in region us-central1., quotaExceeded

1) Can I get a higher limit on my billing project?
2) I believe this is a bug. Firecloud should wait for resources to become available rather than failing.

My current work around for this is to use Dalmatian to monitor the number of active jobs and submit a new job when the number of running jobs is below 64 which is basically doing the load management.

Any advice?
Thanks!

Answers

  • KateNKateN Cambridge, MAMember, Broadie, Moderator admin

    Hi @amaro,

    To answer your first question, yes you can get a higher quota limit. Read this doc to learn how. We haven't yet updated the document as it applies to GPU, as that is a recent addition, but the documentation will be updated shortly. You should email the address specified and mention your GPU quota requirements.

    In regards to your second question, I am looking for more information for you. I'm not sure whether or not failing on quota limits is the expected behavior, so I will get back to you on that soon.

  • amaroamaro Broad InstituteMember, Broadie

    Hi Kate,

    Ok I emailed [email protected] If this isn't a bug can we add it as a feature request? It seems more logical for Firecloud to do the load management for users since it does the job distribution anyway.

  • kshakirkshakir Broadie, Dev ✭✭

    Hi @amaro -

    Can you share the workspace with the failed job with [email protected] and I'll take a look at the logs?

    Thanks!

  • amaroamaro Broad InstituteMember, Broadie

    We are discussing this on slack currently.

  • RuchiRuchi Member, Broadie, Moderator, Dev admin

    Hey @amaro,

    I see two questions being answered here:
    1. Why GPU jobs aren't being queued up properly waiting for quota?
    This is a bug with Pipelines API v1, but it's been confirmed to work with Pipelines API v2.
    Today, FireCloud uses the v1 backend and the v2 backend will eventually be enabled in FireCloud in the near-term future.
    2. How to raise the GPU quota?
    This seems like its a bandaid for issue #1. If you plan on using GPU on FireCloud, it might be the only workaround until v2 is available. Otherwise I can help set up jobs on v2 on a standalone Cromwell.

    I'll keep this post updated as a we have a more concrete timeline for a v2 migration. Adding @Ilyana_Rosenberg to this so she can be aware of this need.

    Thanks!

Sign In or Register to comment.