Latest Release: 03/12/19
Release Notes can be found here.

JES error code 10. Message: 14: VM ggp-14197713620027335497 stopped unexpectedly.

ChipChip 415M 4053Member, Broadie

Hi

Does a failure message that ends with:

JES error code 10. Message: 14: VM ggp-14197713620027335497 stopped unexpectedly.

mean that the VM on which this particular task was running went down and the remedy would be to simply re-run the job?

Tagged:

Best Answers

Answers

  • Tiffany_at_BroadTiffany_at_Broad Cambridge, MAMember, Broadie, Moderator admin

    Hi @Chip,

    Sorry for the delayed response! We had a workshop at the end of last week that kept us very busy. I haven't seen this issue before so I am asking a teammate. I will report back shortly!

    Thanks,
    Tiff

  • ChipChip 415M 4053Member, Broadie

    Hi Tiffany,

    If the preemptible failure mode corresponding to

    "JES error code 10. Message: 13: VM ggp-16478816659238753925 shut down unexpectedly."

    a design feature or a bug? If this is by design, should we interpret this as google telling us to avoid preemptible VMs unless we have time and patience to re-run jobs many times?

    If this is a bug, is there an ETA for a fix?

    Chip

    P.S. I've been getting hundreds of these failures recently and I'm having to run jobs with preemptible:0 at higher cost.

  • KateNKateN Cambridge, MAMember, Broadie, Moderator admin

    Both error message 13 and 14 mean that your preemptible job was preempted. If you have preemptible set to be >0, then your job should auto-retry on another preemptible machine until it either succeeds or runs out of tries based on the amount of tries you wanted to allow.

    Are you seeing it not automatically retry while your preemptible was set >0? If so, that would be a bug. If it fails after the appropriate amount of retries, then that would mean you were simply being preempted by Google.

  • ChipChip 415M 4053Member, Broadie

    Thanks for the update! This is causing havoc in projects in which a large fraction of jobs are getting killed.

  • birgerbirger Member, Broadie, CGA-mod ✭✭✭

    @KateN, could you give us an estimate of when this bug will be fixed? It is creating a lot of problems in our lab, as we run analyses on large datasets and rely heavily on preemptibles to keep our costs down. Having to identify the large numbers of failed workflows due to this problem and re-run them manually doesn't scale for the large quantity of parallel workflows we run.

  • KateNKateN Cambridge, MAMember, Broadie, Moderator admin

    We are currently evaluating the situation, and I will get back to you with specifics and an ETA if possible.

  • jgentryjgentry Member, Broadie, Dev ✭✭✭

    @katen I'd like to correct something here. Message 13 does not imply preemption. It's an "unexpected termination" from Google.

    Cromwell will retry message 13 responses a few times but that's a completely different track than preemption, which is triggered by message 14.

  • KateNKateN Cambridge, MAMember, Broadie, Moderator admin

    Thanks for the clarification @jgentry. My explanation oversimplified.

    @Chip We would like to have someone take a look at this error you are seeing. They need the project ID and a window of time in which the error occurred, so that he can take a look at the logs.

Sign In or Register to comment.