Heads up:
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.

preempted job w/ no caching set to failed state

bhaasbhaas Broad InstituteMember, Broadie

Hi, I received the following for a job:

2018-08-24 15:28:18,726 INFO - PipelinesApiAsyncBackendJobExecutionActor [UUID(43bd5380)ctat_fusion_wf.CTAT_FUSION_TASK_FQPAIRTARGZ:NA:1]: Status change from Initializing to Preempted

and it was given a failure state.

I don't have any preemption settings in my workflow, and I set the job caching turned off (because I have everything in one task w/ all or nothing execution pattern).

Is preemption -> failure state expected here, and is there a more optimal way for me to do this while aiming to keep costs down?

Tagged:

Best Answer

Answers

  • bshifawbshifaw Member, Broadie, Moderator admin

    Hi @bhaas

    Adding preemptibles to a workflow is a great to keep costs down, more on preemptibles can found in the Preemptibles dictionary page. The preemption argument is normally in the runtime attributes block, can you share this section of the workflow here? Also are there any messages in the stderr, stdout, or Jesslog files for that task?

  • bhaasbhaas Broad InstituteMember, Broadie

    Sure thing. Here's the runtime section:

    runtime {
    docker: "trinityctat/firecloud_ctatfusion:0.0.3"
    disks: "local-disk 200 SSD"
    memory: "50G"
    cpu: "16"
    }

    In the bucket, there's just the script file so it didn't get much further.

    From the firecloud job status page, it indicates:

    Failures:Hide
    message: Workflow failed
    causedBy:
    message: Task ctat_fusion_wf.CTAT_FUSION_TASK_FQPAIRTARGZ:NA:1 failed. The job was stopped before the command finished. PAPI error code 10. 14: VM ggp-15372092885908461007 stopped unexpectedly.

    and the workflow log indicates:
    Initializing
    2018-08-24 15:28:18,726 INFO - PipelinesApiAsyncBackendJobExecutionActor [UUID(43bd5380)ctat_fusion_wf.CTAT_FUSION_TASK_FQPAIRTARGZ:NA:1]: Status change from Initializing to Preempted
    2018-08-24 15:28:21,361 INFO - $h [UUID(43bd5380)]: Copying workflow logs from /cromwell-workflow-logs/workflow.43bd5380-48bb-4574-922b-6c54440c2236.log to gs://fc-1f65e310-4bf0-4601-8d9a-1715de51a4cb/67313ad1-b6da-48bc-8993-b1bfb7731400/workflow.logs/workflow.43bd5380-48bb-4574-922b-6c54440c2236.log

  • bshifawbshifaw Member, Broadie, Moderator admin

    Hey @bhaas

    You may want to try rerunning the workflow again. According to another forum post this error may occur on workflows that do not use preemptibles. If you see this error again on the same task it may also help to use maxRetries.

    Let us know if this helps. I'll let the dev team know about this error message.

  • bhaasbhaas Broad InstituteMember, Broadie

    Do you have a recommended update to my runtime section?

    This type of error has been happening to some substantial percentage of my jobs, probably totaling hundreds, but I haven't fully explored it yet. There are a large number that I'll need to rerun later. The jobs are kinda long (hours) and expensive (require 50G RAM), and can be preempted to save costs, but I need to optimize for costs here. If they get preempted, they basically need to start over.

  • bshifawbshifaw Member, Broadie, Moderator admin

    I'll check with the dev team for recommendation other then maxRetries and get back to you.

  • bhaasbhaas Broad InstituteMember, Broadie

    Thanks, Jeff. I'll see how many jobs this impacted. I have about 10k jobs, and so even low frequency events could happen to give sufficient examples here. It won't be a huge issue, because running large numbers of jobs isn't going to be a regular thing for me. If I get them done eventually, I'm good. ;-)

  • KateNKateN Cambridge, MAMember, Broadie, Moderator admin

    Thanks for your patience @bhaas. I will make sure that we follow up with Google on the error while we work on updating FireCloud to PAPI v2. In the meantime, I would suggest using maxRetries, as was stated earlier.

  • jgentryjgentry Member, Broadie, Dev ✭✭✭

    @bhaas Cool, thanks. To set expectations the general answer is likely to be "upgrade to Pipelines API v2" (which Firecloud is in the midst of doing over the next quarter or so), but it'd be good to get a sense of the numbers - like I said, while this has happened in the past it's been really uncommon so it's possible that what's been stumbled upon here is another issue which looks similar.

  • bhaasbhaas Broad InstituteMember, Broadie

    I had about 2k jobs that failed in my earlier run. I added the flags in my wdl to turn off retries and preemption and threw them back up. About half of these failed, but looking through a couple, it might have been for other reasons. It's progress. :smile:

Sign In or Register to comment.