Cryptic failure messages

gauthiergauthier Member, Broadie, Moderator, Dev admin
edited June 2017 in Ask the FireCloud Team

Hi there,

I have a method config that worked swimmingly with a single sample, but is failing like crazy on my ~7k sample set. The top level error message is "Unexpected failure in EJEA." Is that something I can fix or retry or is it more likely on the Google side? Lots of my jobs spent eons waiting for quota. The only one that finished was the single sample that was call cached. More error details below.

Thanks!
Laura

From the workspace broad-ccdg-dev/Delly_Kathiresan_VIRGO_MESA_TAICHI_Hg19bams

Workflow ID:b5bf24a2-cfca-45ea-bae4-55daf65afe51

message: Unexpected failure in EJEA.
causedBy: 
message: java.util.concurrent.RejectedExecutionException: Task [email protected] rejected from [email protected][Running, pool size = 200, active threads = 196, queued tasks = 1000, completed tasks = 537467]
causedBy: 
message: Task [email protected] rejected from [email protected][Running, pool size = 200, active threads = 196, queued tasks = 1000, completed tasks = 537467]
Post edited by Geraldine_VdAuwera on

Answers

  • ebanksebanks Broad InstituteMember, Broadie, Dev ✭✭✭✭
    edited June 2017

    Hey guys,

    I've been looking this over with Laura and it's not looking great. So far we are seeing 1830 failures and 0 successfully completed workflows (minus the initial one that was call-cached). The failure messages encompass a whole range of reasons; see below for a sampling. Do you guys know what's going on? Is there a single over-arching cause of these failures?

    Below are a few of the workflow IDs and their corresponding failure messages. The first 2 are fairly common, while the 3rd I don't see often.

    545d73b0-63fe-492c-9387-c9e4508a1eff
    message: The WorkflowDockerLookupActor has failed. Subsequent docker tags for this workflow will not be resolved.
    
    b5bf24a2-cfca-45ea-bae4-55daf65afe51
    message: Unexpected failure in EJEA.
    causedBy: 
    message: java.util.concurrent.RejectedExecutionException: Task [email protected] rejected from [email protected][Running, pool size = 200, active threads = 196, queued tasks = 1000, completed tasks = 537467]
    causedBy: 
    message: Task [email protected] rejected from [email protected][Running, pool size = 200, active threads = 196, queued tasks = 1000, completed tasks = 537467]
    
    70a9248e-3e8d-4cb4-bf13-592d5b58a994
    message: Task DellyGermlineGenomeWorkflow.DellyDEL:5:1 failed. JES error code 10. Message: 13: VM ggp-15173193156103938930 shut down unexpectedly.
    
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    I'll call in Cromwell reinforcements on this one for the "Unexpected failure in EJEA" since that's one of their cryptic errors we've seen pop up recently.

    That being said the "VM shut down unexpectedly" sounds more like either a Google problem or whatever you were running crashed the machine badly... Have you tried running that sample by itself?

  • gauthiergauthier Member, Broadie, Moderator, Dev admin

    The sample by itself just about flies through with no problems. :-/

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Ech. Does that apply to all three of the error modes reported here?

  • gauthiergauthier Member, Broadie, Moderator, Dev admin

    I don't know what you mean. All of those error modes came from the same big 7600 sample submission and I never saw any of those errors with a single sample submission.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    I'm trying to triage whether these are likely to be scaling errors vs. issues that are due to something wrong with individual samples. Sounds like the former, potentially.

    Btw the delays may not be due to quota, but to submission delays caused by a Google bug as noted here.

  • mcovarrmcovarr Cambridge, MAMember, Broadie, Dev ✭✭

    What's the Cromwell version? We've put in lots of optimizations lately to reduce errors that look like

    Task [email protected] rejected
    
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    It's C27 in FireCloud (I assume this is in FC, right @gauthier ?)

  • ebanksebanks Broad InstituteMember, Broadie, Dev ✭✭✭✭

    Yes, in FireCloud

  • gauthiergauthier Member, Broadie, Moderator, Dev admin

    I hit that PAPI bug with a previous submission of the same workflow/samples and we aborted and started over here. Here I've had a seemingly better throughput of workflows.

    Bob H. and Eric have run a couple tools on these samples already without too much trouble, so I don't think it's the samples.

  • mcovarrmcovarr Cambridge, MAMember, Broadie, Dev ✭✭

    I suspect an overflowed queue of database work underlies the first two failures, while the third looks more like JES / GCE flakiness. Unfortunately Cromwell 27 should already have all of our optimizations to reduce database queue overflow.

  • ebanksebanks Broad InstituteMember, Broadie, Dev ✭✭✭✭

    @mcovarr what should we do as users here? Wait until everything is done and then relaunch?
    Also, did any of these failed jobs actually incur cost?

  • abaumannabaumann Broad DSDEMember, Broadie ✭✭✭

    would you mind giving reader access to [email protected] so I can take a look and see if I can gather any other info to understand the cause?

  • mcovarrmcovarr Cambridge, MAMember, Broadie, Dev ✭✭

    @ebanks Those particular database errors really have no context so unfortunately it's not clear if cost had been incurred. These are really nasty errors and I had hoped we were past them with our C 27 improvements, but it seems like we have more work to do. :frowning:

  • gauthiergauthier Member, Broadie, Moderator, Dev admin

    @abaumann I'm not an owner, but Namrata just added GROUP_support to the workspace. Let me know if you need anything else.

  • gauthiergauthier Member, Broadie, Moderator, Dev admin

    Here's one that looks like what @mcovarr was talking about from Workflow ID:f5d37e7e-4c38-431f-b377-7b9e38e5786b:

    message: Task [email protected] rejected from [email protected][Running, pool size = 200, active threads = 200, queued tasks = 1000, completed tasks = 1543924]

    "completed tasks = 1543924" sounds so positive! And yet... failure.

  • abaumannabaumann Broad DSDEMember, Broadie ✭✭✭

    poking around more myself I can't see anything more helpful either. I think for these the best bet is to relaunch and try again on the failed workflows. If it fails again in the same way then we can see if that helps debug things.

    Given this error it's hard to tell if a VM was run or not - the only way to tell this would be if I tried to look to see if any GGP VMs were spun up in this project that were not recorded by Cromwell - should I try to do that?

  • gauthiergauthier Member, Broadie, Moderator, Dev admin

    When you say "relaunch", you just mean run the same thing again, right? There's no retry button or anything?

  • abaumannabaumann Broad DSDEMember, Broadie ✭✭✭

    Yes rerun the same again - we will be adding a feature some time soon to let you relaunch from the monitor tab directly which would do this more easily for you

  • danbdanb Member, Broadie ✭✭✭

    tdlr: I suggest that FC sets database -> db -> queueSize = 10000 or even 100K and observe.

    We are hitting the queue size limit for our database connection, but it is set to a default value of 1K.

    Because this load is spiky, I think it is worth a shot just giving ourselves a lot more headroom and letting the DB chew through all the activity in a relatively short time period.

    I've filed a bug here.

  • gauthiergauthier Member, Broadie, Moderator, Dev admin

    I resubmitted. The old workflow is still aborting after ~24 hours, which is surprising. I'm also getting pretty much the same errors on the resubmitted jobs. And the one success that was call-cached is "running" the final tasks again, very slowly -- also surprising.

  • KateNKateN Cambridge, MAMember, Broadie, Moderator admin

    Hi @gauthier! We recently made some improvements to the backend, and we think they may be able to help you. When you try running this again, do you see improvement? (tasks run faster, workflow doesn't abort)

  • gauthiergauthier Member, Broadie, Moderator, Dev admin

    @KateN No apparent improvement. I re-launched the analysis but I'm still getting the EJEA errors like I mentioned before. One in particular is message: Task [email protected] rejected from [email protected][Running, pool size = 200, active threads = 200, queued tasks = 1002, completed tasks = 499525]

  • dlivitzdlivitz Member, Broadie

    I am seeing this error as well in an unrelated workspace/task

  • KateNKateN Cambridge, MAMember, Broadie, Moderator admin
    edited July 2017

    Our developers believe that this error will be fixed by increasing the slick batch size. We are working on a fix currently, and it is planned to go out in the next release. (We had one go out today; this will be in the following release.)

  • gauthiergauthier Member, Broadie, Moderator, Dev admin

    Thanks, Kate -- I'll keep my fingers crossed!

  • ebanksebanks Broad InstituteMember, Broadie, Dev ✭✭✭✭

    Hi Kate, what's the ETA on that next release? Should we just sit tight until that's ready?

  • KateNKateN Cambridge, MAMember, Broadie, Moderator admin

    I don't have a specific ETA, but I do believe it is planned for early next week.

  • gauthiergauthier Member, Broadie, Moderator, Dev admin

    I still had a lot of failures this time around, but most of the workflows succeeded! What's the easiest way to re-run the failures right now? There are at least 1175 of them.

    Thanks,
    L

    P.S. Most of the failures were slick-related, but I found a couple new ones (at least new to me):

    Workflow ID:981b194c-9d48-48d2-b9e1-0e8f2c9b9f35
    message: The WorkflowDockerLookupActor has failed. Subsequent docker tags for this workflow will not be resolved.

    Workflow ID:36a19983-6985-467f-ad47-3353864b63ba
    message: Task DellyGermlineGenomeWorkflow.DellyDUP:6:1 failed. JES error code 10. Message: 13: VM ggp-7666328913874879383 shut down unexpectedly.

  • birgerbirger Member, Broadie, CGA-mod ✭✭✭

    I have made a request, through the forum, for a way to easily re-run failed workflows. See https://gatkforums.broadinstitute.org/firecloud/discussion/9647/rerun-workflows-in-failed-or-aborted-state#latest

  • KateNKateN Cambridge, MAMember, Broadie, Moderator admin
    edited July 2017

    Currently the easiest way to re-run is to execute the job again with no changes, and make sure you are using call caching. This will work if your first run with the 1175 failures had call-caching enabled, and your second run has it enabled as well. The second run should find a "hit" for every successfully run job, and simply copy over the outputs, leaving it to re-run the failures.

    It isn't ideal, since this involves a lot of copying work, but @birger's feature request linked above should help soon.

    The next release is planned to go out later today tomorrow, so I'll post again on here when we have it live, for you to test whether or not it helps.

    Edited to update the day of the release; I've just been told we are not releasing today due to a demo we are giving of the FireCloud in NY.

  • KateNKateN Cambridge, MAMember, Broadie, Moderator admin

    The release is now live; we would love to hear if it helps with your problem.

  • gauthiergauthier Member, Broadie, Moderator, Dev admin

    Given that I have 7500 samples I'm terrifying that the call caching won't work, especially since the versions switched over in between. I don't remember explicitly disabling call caching on the first run. Is there a way to "stage" jobs to make sure I won't re-run the 6400 successes? Or if they do get re-run and fail, will the successful output still be in my data model?

  • KateNKateN Cambridge, MAMember, Broadie, Moderator admin
    Call caching does work between different versions of Cromwell after version 19 I believe. We are currently on version 28, and based on the time frame of this question, you started well after version 19.

    If you left call caching active (it is a checkbox that is checked by default), then the 6400 successes will be carried over to a call-cached second run. However, if you'd like to stage the job, I'd recommend creating a test set that contains one or two samples you know succeeded and one or two samples you know failed. Then you can check the Monitor tab to see if the test case properly found a call-cached hit for the previously successful samples.

    If call caching wasn't active, and you re-ran all 7500 samples from scratch, and some previously successful samples failed in this second run, the successful output will not be overwritten in your data model. You will still keep that succeeded output.

    Sorry for taking so long to get back to you, but I hope this helps! Please let me know if the failed samples properly run or not with the new update.
  • gauthiergauthier Member, Broadie, Moderator, Dev admin

    Everything went great this time! I like your "staging" suggestion -- I'll use that if I'm ever question the call caching in the future.

    I had a few failures, but they were all non-Cromwell-related.

  • KateNKateN Cambridge, MAMember, Broadie, Moderator admin

    Glad to hear they aren't Cromwell-related at least! I hope you're able to get the failures fixed.

Sign In or Register to comment.