Latest Release: 05/01/19
Release Notes can be found here.

Jobs stuck in queued status

mleventhalmleventhal Cambridge, MAMember, Broadie ✭✭

Hello,

Since around midnight last night to this morning, all of the jobs I have launched are stuck in the "queued" state. This extends from jobs that are run on a single sample to jobs run on sample sets of 164 samples, so "batching" my submissions does not seem like a solution. Additionally, I am not under the impression that I am running up against any kind of quota, so I am perplexed as to what could be the issue. Any insight would be appreciated, thank you!

Best,
Matt Leventhal

Best Answer

Answers

  • lelaginalelagina Member, Broadie

    Hello,

    I am experiencing similar issues. Jobs stuck in Submitted state for an hour. Even though there are no jobs in the queue.

    Running on pair set of 50 pairs.

    Thank you,
    Luda.

  • KateNKateN Cambridge, MAMember, Broadie, Moderator admin

    Thanks for the report, I'm looking into it.

  • KateNKateN Cambridge, MAMember, Broadie, Moderator admin

    Both @mleventhal and @lelagina, could you please share your workspaces with [email protected] (workspace>share button). Then please post here the following information:

    Workspace Name
    Submission ID
    Workflow ID

    We are trying to look into why and how these workspaces are hanging, and the developers would like to look at the metadata further.

  • lelaginalelagina Member, Broadie

    Workspace Name:
    broad-firecloud-wuclonal/Wu_FollicularLyphoma_Data_Clinical_Pipeline
    Submission ID
    3f1f26f5-b4f7-4036-ae38-1abdd22a13dd
    Workflow ID

  • mleventhalmleventhal Cambridge, MAMember, Broadie ✭✭

    Thank you for the prompt repsonse, I have shared the workspaces.

    The two workspaces with hanging jobs are as follows

    ebert-fc/new_SCB0024
    Submission ID:
    2b670ff0-430c-4b6a-921b-8f50591bc957

    ebert-fc/exac_maf_aggregation
    Submission IDs:
    8dd19abd-7704-427d-bce1-cb46a932594b
    cc87d433-2a03-4821-be9a-927e8adfcff3

    There are no workflow IDs anymore because I aborted the jobs (or in one case the job is still aborting). Let me know if I should kick anything off for purposes of demonstration

  • mleventhalmleventhal Cambridge, MAMember, Broadie ✭✭

    My apologies, all but 8dd... lack workspace IDs, Here are some workspace IDs for 8dd...

  • lelaginalelagina Member, Broadie

    Hello Kate,

    Thank you for quick help. I will monitor the progress.

    Did developers find solution only for the workspaces that Matt and I shared with you or for any workspace? As I have jobs submitted problem in another workspaces.

    Thank you,
    Luda.

  • lelaginalelagina Member, Broadie

    Hello Kate,

    Jobs are running in all my workspaces.

    Thank you,
    Luda.

  • mleventhalmleventhal Cambridge, MAMember, Broadie ✭✭

    Hi Kate,

    My jobs are running too, thank you!

    As a question about the error: would I run into a similar issue if I attempted to run a job on ~7000 samples that runs for about 10 minutes per sample? The task runs on mafs that are only a few kb-MB in size. Would you suspect that this would create an issue with the queue or would this be unlikely to cause an issue? Thank you!

    Best,
    Matt

  • francois_afrancois_a Member, Broadie ✭✭

    Seeing this issue again -- thanks in advance for looking into it.

    Example:
    submission: 0828e925-ecac-4112-a072-9f68d93761db
    workflow: 38ad020d-15d2-4886-9b19-36c91e759a94

  • RobinKRobinK Member

    I am running into the issue as well:
    SubmissionId: 9fcfc72b-3cc8-403c-b259-de505944c6d0

    Thanks!

  • KateNKateN Cambridge, MAMember, Broadie, Moderator admin

    @mleventhal I am looking into your question, but I don't have an answer for you yet. Just wanted to let you know I haven't forgotten about you.

    @francois_a and @RobinK Beginning last night (5/17) around 10 pm EST, we noticed a large influx of jobs. Cromwell is currently running the maximum number of jobs we allow, which means there is a backlog waiting in the QueuedInCromwell state.

    I'm working to find out more information about how long this delay might be, but for now we ask for your patience.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Update on this: it seems that we're hitting some limits set by Google due to an individual submission that amounts to a very large number of jobs (~60k). We're working with GCP support to figure out a way forward. See the service notice I just posted on the blog and banner in the FC portal for updates.

  • KateNKateN Cambridge, MAMember, Broadie, Moderator admin

    @mleventhal Following up on your question from earlier:

    As a question about the error: would I run into a similar issue if I attempted to run a job on ~7000 samples that runs for about 10 minutes per sample? The task runs on mafs that are only a few kb-MB in size. Would you suspect that this would create an issue with the queue or would this be unlikely to cause an issue? Thank you!

    I don't believe a job run on 7000 samples would cause a queue issue unless each sample was being scattered individually a 50 times. That case would involve 350k concurrent calls, which is larger than Google allows us to handle right now. Doug's comment here explains the current limitations on job size fairly well.

  • mleventhalmleventhal Cambridge, MAMember, Broadie ✭✭

    Thank you, this and Doug's comment were very helpful!

Sign In or Register to comment.