Latest Release: 05/01/19
Release Notes can be found here.

Service notice: Workflows queued due to scaling limitation

Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
edited May 2018 in FireCloud Announcements

UPDATE: This issue described below has been resolved.

Due to an individual user's submission that amounts to a very large number of jobs (~60k), all new workflow submissions are currently being held in the queue (with status QueuedInCromwell). To be clear, as far as we can tell this is NOT a FireCloud malfunction; it seems to be a Google Cloud limitation that we are encountering for the first time. We are working with GCP support and evaluating options to unblock the queue, hopefully without interrupting that one very ambitious and totally legitimate submission. We will strive to resume normal workflow throughput by Monday morning EST.

We understand that this is causing many of you considerable inconvenience, yet we are hopeful that this case will provide an opportunity to push back the current limitations to the next level. Please remember that what we are all doing here, together, is blazing a new trail; building a new model for how we do science at scale, collaboratively. The fact that these scaling problems are arising at all demonstrates that we are on the right path, that the research community needs this level of scalability. And we will do everything in our power to deliver it.

Thank you for your patience and stay tuned for updates.

Post edited by Geraldine_VdAuwera on

Comments

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    UPDATE: We were able to increase the number of workflows than can be processed concurrently by the system so all workflows should now be running (= no longer in QueuedInCromwell state). This should resolve symptoms for all users; let us know in this thread if you experience any further issues of the same type.

  • birgerbirger Member, Broadie, CGA-mod ✭✭✭

    What is the upper limit on the size of a submission (i.e., the number of workflows) that can be made without impacting other users (i.e., not getting in the QueuedInCromwell state)? Is the size of the submission dependent on the complexity of the workflow as well as the number of workflows?

    thanks.

  • dvoetdvoet Member ✭✭

    It actually depends on the complexity of the workflow where that is measured by the number of concurrent calls created. The number of projects involved and job completion rate are also factors. Due to a network limitation (fixed when we go to PAPIv2), the maximum number of VMs we can launch in a project is ~4k (assuming other quotas are not reached first). If each job takes 1 hour then a single project can finish 4k jobs per hour. If all the workflows in a submission launch 24k concurrent calls it will take a minimum of 6 hours. I think this is a good target to shoot for: at most 24k concurrent 1 hour calls per project. However we can only handle about 3 of those concurrently before having the QueuedInCromwell problem again. If you want to spread across more projects for higher throughput, keep your overall concurrent job count <60k please.

  • birgerbirger Member, Broadie, CGA-mod ✭✭✭

    Thank you Doug.

Sign In or Register to comment.