Heads up:
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.

JES error code 2. Message: Instance failed to start due to preemption

francois_afrancois_a Member, Broadie ✭✭

Seeing multiple instances of this (for example, workflow 2bd68c76-d225-4cc8-a226-e5eb28c48474, submission 20b7af64-5a22-4c4b-8b19-e8b03de85880; I've added [email protected] to the workspace). Maybe related to https://gatkforums.broadinstitute.org/firecloud/discussion/10429/failed-jes-error-code-2-message-gaia-unavailable ?

These failures happened on individual scatter jobs. With hundreds of tasks getting terminated as a result, the cost of such errors is non-negligible.

While intermittent, JES errors do occur regularly (mostly code 10). Are any near-term fixes planned? Would it be possible to implement a mechanism avoiding termination of unaffected scatter jobs?

Answers

  • KateNKateN Cambridge, MAMember, Broadie, Moderator admin

    From the error code, it would appear that particular job failed due to being preempted. Were the other tasks that were subsequently terminated dependent on the task that was preempted?

    Thank you for sharing the workspace with us already. Could you let me know the name of the workspace so I can have a developer take a closer look?

  • jtsujijtsuji CambridgeMember, Broadie

    We obtained the same error message while running scatter jobs on preemptible VMs (the screenshot attached below). Looking at the json file, it appears that the job failed to run in the first preemptible attempt (preemptible attempts were set to 3 in this run). This error happened many times in the tasks that scatter jobs, and usually rerunning jobs solves the error. We are going to run a workflow for a large datasets and are wondering whether there is a way to avoid the error.

  • ebanksebanks Broad InstituteMember, Broadie, Dev ✭✭✭✭

    @KateN, this doesn't make any sense to me. If a task gets pre-empted more than the number of times specified in the WDL then it's supposed to drop down and use a normal instance rather than fail the whole task.

    I'm seeing a bunch of these errors also and they look like bugs. None of these jobs should be failing.

  • KateNKateN Cambridge, MAMember, Broadie, Moderator admin

    The error message you are seeing is due to an unhandled case introduced by Google fairly recently. We have a fix for it, which should go out when we upgrade to Cromwell version 30 sometime next week.

  • jgentryjgentry Member, Broadie, Dev ✭✭✭

    Error 10 Message 14 implies a preemption. These are handled by Cromwell via the preemption flag in one's WDL. After trying N times it'll fall back to a standard instance.

    Error 10 Message 13 imples an "unexpected termination" by Google. Cromwell will retry these up to 3 times before giving up.

    These two counts are completely independent.

    As far as I know there's no bug in Cromwell here.

  • francois_afrancois_a Member, Broadie ✭✭

    The main problem when "Error 10" occurs is that all other scatter jobs get terminated. Since re-running the error 10 jobs usually works, wouldn't it make sense to leave the other scatter jobs running in this case? The cost implications can be non-negligible.

  • jgentryjgentry Member, Broadie, Dev ✭✭✭

    @francois_a "Error 10" isn't particularly meaningful in and of itself. You need the "Message" part.

    Cromwell has multiple modes it can be configured to run in in terms of how it proceeds when it detects a failure - e.g. proceeding as far down the graph as possible vs stopping as soon as possible. I do not know what Firecloud is configuring Cromwell to do here.

  • ebanksebanks Broad InstituteMember, Broadie, Dev ✭✭✭✭

    @jgentry the problem for me is really that the error message is misleading. What the hell does it mean that an instance failed to start due to preemption? Why wouldn't this be one of the failures for which Cromwell automatically retries?

    And even if this is fixed in v30, we should consider supplementing the Google error messages with guidance on what a user should do about it. Retry? Post in forum? Give up?

  • KateNKateN Cambridge, MAMember, Broadie, Moderator admin

    There are two error messages being discussed in this thread: JES Error Code 2 is going to be patched when FireCloud upgrades to Cromwell version 30, which is planned for this week.

    JES Error Code 10, messages 13 and 14 are what @jgentry has been discussing. We are currently working with Google to determine what the expected behavior should be for message 13, which is one that several users have seen issues with.

    @ebanks I will bring up your suggestion to the teams about supplementing error messages with further guidance. That would definitely go a long way with helping people know what to do when encountering errors fixed with a simple retry.

  • jgentryjgentry Member, Broadie, Dev ✭✭✭

    @ebanks At no point ever should a Cromwell job ultimately fail due to preemption. One mgiht find a log message to that effect but that's earlier in the process. When the maximal number of preemptions have been reached we fall back to a standard instance, and that won't be preempted. I don't see any actual Cromwell output in this forum thread so I have no idea what you're directly referring to.

  • jgentryjgentry Member, Broadie, Dev ✭✭✭

    @ebanks dug into this a bit more and cross-referenced w/ information from Google. Error 2 (referenced in the subject of this thread, but nowhere else so I can't see it) is a "we have no idea what went wrong" error from Google. At times the specific error message references preemption. Good news: As of a month or so ago we detect that as a preemption. Presumably that's coming soon to a Firecloud near you.

    There's also Error 10 (ABORTED). Message 14 is preemption. Only 10/14 is supposed to imply a known preemption. There's also 10/13 which is "It aborted, but we don't know why" from Google. Sometimes that's also due to a preemption, but so far I don't know of a specific error string to latch on to as a heuristic.

    We can add all of these and more to the "retry as if it were a preemption" list, but the problem there is that makes it that many fewer retry attempts before you're paying for a full instance. With the current logic, if the VM just barfed for unknown reasons we'd retry that w/o decrementing the preemption count (up to a certain number of times, of course). Considering how cost conscious people tend to be, we try to err on the side of giving people more preemption attempts.

  • ebanksebanks Broad InstituteMember, Broadie, Dev ✭✭✭✭

    Yeah, it's the Error 2s that have been killing our workflows. If that's gonna be fixed in FireCloud soon then I'm happy.

  • KateNKateN Cambridge, MAMember, Broadie, Moderator admin

    Unfortunately due to a misunderstanding on my part, the version of Cromwell 30 (v. 30.1), to which FireCloud was upgraded to yesterday, did not include the fix for the Error2's we've been seeing. That fix is currently on the Cromwell 30 hotfix version, but should be incorporated into a solid version (30.2) soon. I'm waiting to hear back from the devs now with a more firm date of when this error will be fixed in FireCloud, and I will update you as soon as I know.

  • KateNKateN Cambridge, MAMember, Broadie, Moderator admin

    The patch for this error went out with the latest release this morning. Please let us know if you see any more issues with it.

Sign In or Register to comment.