Continue on SIGTERM code

One of my tasks submits a process which kills the process when it successfully completes (eggnog-mapper, I believe the SIGTERM is send). I want the workflow to continue since this is normal/successfull behaviour. I have tried to set the continueOnReturnCode to "true" and [0, 15], however the workflow keeps aborting and not submitting the succeeding task. Am I missing something?

   # The defaults for runtime attributes if not provided.
    default-runtime-attributes {
      failOnStderr: false
      continueOnReturnCode: true
    }

Answers

  • ChrisLChrisL Cambridge, MAMember, Broadie, Moderator, Dev ✭✭

    Hi @Irsan_Kooi could you provide a little more context on what you're trying to do?

    • What the task is (eg could you give a pseudo-bash that demonstrates what your task is trying to do?)
    • What backend you're running on
    • What Cromwell and WDL version you're using
    • What the failure message looks like in the Cromwell logs

    Is 15 the correct exit code here? I believe that the convention is to have an exit code of 128 + signal and since SIGTERM is 15 I would expect the exit code to be 128+15 = 143.

  • Irsan_KooiIrsan_Kooi Member

    Using 128 + [0/9/15] does not help

    continueOnReturnCode: [0,137,143]
    

    The task sends a command to annotate protein sequences with eggnog mapper

    python2.7 /path/to/eggnog-mapper-master/emapper.py \
            -i /path/to/proteins.fasta \
            --output my_output_prefix \
            -d euk \ # eukarotes
            --usemem \ # load b in memory
            --cpu 20
    

    I believe that the shutdown behaviour (kill the process, send SIGTERM) is defined in this script, in the shutdown_server() method

    I use a local backend, submit with cromwell run (not server), wdl 0.14, cromwell 31.

    In the relevant task output dir I see the following in stderr.kill

    /path/to/my/execution/script.kill: line 8: kill: (142531) - No such process
    

    and /path/to/my/execution/script.kill contains:

    #!/bin/bash
    kill_children() {
      local pid=$1
      for cpid in $(pgrep -P $pid); do
        kill_children $cpid
      done
      echo killing $pid
      kill $pid
    }
    
    kill_children 142531
    

    The cromwell logs:

    [2018-07-11 08:01:41,60] [info] BackgroundConfigAsyncJobExecutionActor [692fd6e8augustus_eggnog_assembly.eggnog_map:NA:1]: executing: /bin/bash /home/irsan/projects/dev_workflow/cromwell-executions/augustus_eggnog_assembly/692fd6e8-764e-43df-b4c1-edaf6bb79238/call-eggnog_map/execution/script
    [2018-07-11 08:01:45,20] [info] BackgroundConfigAsyncJobExecutionActor [692fd6e8augustus_eggnog_assembly.eggnog_map:NA:1]: job id: 142531
    [2018-07-11 08:01:45,20] [info] BackgroundConfigAsyncJobExecutionActor [692fd6e8augustus_eggnog_assembly.eggnog_map:NA:1]: Status change from - to WaitingForReturnCodeFile
    ^An[2018-07-11 08:08:14,11] [info] Starting coordinated shutdown from JVM shutdown hook
    [2018-07-11 08:08:14,17] [info] Workflow polling stopped
    [2018-07-11 08:08:14,22] [info] Shutting down WorkflowStoreActor - Timeout = 5000 milliseconds
    [2018-07-11 08:08:14,26] [info] Shutting down WorkflowLogCopyRouter - Timeout = 5000 milliseconds
    [2018-07-11 08:08:14,29] [info] Shutting down JobExecutionTokenDispenser - Timeout = 5000 milliseconds
    [2018-07-11 08:08:14,29] [info] Aborting all running workflows.
    [2018-07-11 08:08:14,29] [info] JobExecutionTokenDispenser stopped
    [2018-07-11 08:08:14,31] [info] WorkflowStoreActor stopped
    [2018-07-11 08:08:14,32] [info] WorkflowLogCopyRouter stopped
    [2018-07-11 08:08:14,32] [info] Shutting down WorkflowManagerActor - Timeout = 3600000 milliseconds
    [2018-07-11 08:08:14,32] [info] WorkflowManagerActor Aborting all workflows
    [2018-07-11 08:08:14,33] [info] WorkflowExecutionActor-692fd6e8-764e-43df-b4c1-edaf6bb79238 [692fd6e8]: Aborting workflow
    [2018-07-11 08:08:14,51] [info] BackgroundConfigAsyncJobExecutionActor [692fd6e8augustus_eggnog_assembly.eggnog_map:NA:1]: BackgroundConfigAsyncJobExecutionActor [692fd6e8:augustus_eggnog_assembly.eggnog_map:NA:1] Aborted StandardAsyncJob(142531)
    [2018-07-11 08:08:31,06] [info] BackgroundConfigAsyncJobExecutionActor [692fd6e8augustus_eggnog_assembly.eggnog_map:NA:1]: Status change from WaitingForReturnCodeFile to Done
    [2018-07-11 08:08:31,22] [info] WorkflowExecutionActor-692fd6e8-764e-43df-b4c1-edaf6bb79238 [692fd6e8]: WorkflowExecutionActor [692fd6e8] aborted: augustus_eggnog_assembly.eggnog_map:NA:1
    [2018-07-11 08:08:32,21] [info] WorkflowManagerActor All workflows are aborted
    [2018-07-11 08:08:32,21] [info] WorkflowManagerActor stopped
    [2018-07-11 08:08:32,21] [info] WorkflowManagerActor All workflows finished
    [2018-07-11 08:08:32,21] [info] Connection pools shut down
    [2018-07-11 08:08:37,37] [info] Shutting down SubWorkflowStoreActor - Timeout = 1800000 milliseconds
    [2018-07-11 08:08:37,37] [info] Shutting down JobStoreActor - Timeout = 1800000 milliseconds
    [2018-07-11 08:08:37,37] [warn] Coordinated shutdown phase [service-stop] timed out after 5000 milliseconds
    [2018-07-11 08:08:37,37] [info] Shutting down CallCacheWriteActor - Timeout = 1800000 milliseconds
    [2018-07-11 08:08:37,37] [info] SubWorkflowStoreActor stopped
    [2018-07-11 08:08:37,41] [info] JobStoreActor stopped
    [2018-07-11 08:08:37,58] [warn] Couldn't find a suitable DSN, defaulting to a Noop one.
    [2018-07-11 08:08:37,62] [info] Using noop to send events.
    [2018-07-11 08:08:37,69] [info] Shutting down ServiceRegistryActor - Timeout = 1800000 milliseconds
    [2018-07-11 08:08:37,69] [info] Shutting down DockerHashActor - Timeout = 1800000 milliseconds
    [2018-07-11 08:08:37,69] [info] Shutting down IoProxy - Timeout = 1800000 milliseconds
    [2018-07-11 08:08:37,69] [info] CallCacheWriteActor Shutting down: 0 queued messages to process
    [2018-07-11 08:08:37,69] [info] CallCacheWriteActor stopped
    [2018-07-11 08:08:37,69] [info] WriteMetadataActor Shutting down: 0 queued messages to process
    [2018-07-11 08:08:37,69] [info] KvWriteActor Shutting down: 0 queued messages to process
    [2018-07-11 08:08:37,70] [info] DockerHashActor stopped
    [2018-07-11 08:08:37,70] [info] IoProxy stopped
    [2018-07-11 08:08:37,70] [info] ServiceRegistryActor stopped
    [2018-07-11 08:08:37,80] [info] Database closed
    [2018-07-11 08:08:37,80] [info] Stream materializer shut down
    [2018-07-11 08:08:37,84] [error] Outgoing request stream error
    akka.stream.AbruptTerminationException: Processor actor [Actor[akka://cromwell-system/user/StreamSupervisor-1/flow-52-0-mergePreferred#559363362]] terminated abruptly
    [2018-07-11 08:08:37,84] [info] Using noop to send events.
    
  • ChrisLChrisL Cambridge, MAMember, Broadie, Moderator, Dev ✭✭

    It looks like the SIGTERM that your process is having sent to itself is indistinguishable from the SIGTERM that Cromwell would send when it aborts a job.

    Because of that, it looks like Cromwell thinks that the job has been aborted rather than completing normally, and so doesn't even get as far as the "was that a valid return code" check.

    I've made this an issue in our github repository for you to follow: https://github.com/broadinstitute/cromwell/issues/3896

    In the meantime, you might be able to work around this in one of two ways:

    • Change the signal sent to the task. I believe the local backend is hard-coded with a specific exit code to indicate aborted, so anything else should trigger the usual "was it a valid exit code" check.
    • Use the PAPI backend, which detects aborts in a different way so might not trigger this failure.
  • ChrisLChrisL Cambridge, MAMember, Broadie, Moderator, Dev ✭✭

    Another possibility if neither of those appeal - perhaps you add a little catch to the end of your command section to catch the exit code from the python service and turn that into a more Cromwell-friendly exit code for the command as a whole? @kshakir @Ruchi do we have any examples of that sort of thing?

  • kshakirkshakir Broadie, Dev ✭✭

    Maybe as a workaround the command {} block could use capture the return code using sh, check for the 143 and then exit with something else.

    python -c "exit(143)"
    
    EXIT_STATUS=$?
    
    if [ "${EXIT_STATUS}" -eq 143 ]; then
      exit 0
    else
      exit "${EXIT_STATUS}"
    fi
    
  • Irsan_KooiIrsan_Kooi Member
    edited July 16

    I think I am almost there, but the problem is that the "${EXIT_STATUS}" is an undefined variable (widdle tries to interpolate ${...} ? ) so I have tried the following which seems to have no effect

    task eggnog_map {
            String eggnog_mapper
            String python_v2
            File proteins_fasta
            Int num_cpu
            String database
            # get basename of input file, remove file suffix: anything that follows
            # after the first occurence of a dot
            String outputLabel = sub(basename(proteins_fasta),"\\..*","")
    
            command {
                    ${python_v2} ${eggnog_mapper} \
                            -i ${proteins_fasta} \
                            --output ${outputLabel}_eggnog \
                            -d ${database} \
                            --usemem \
                            --cpu ${num_cpu}
    
    
        EXIT_STATUS=$?
    
        if [ $EXIT_STATUS -eq 143 ]; then
          exit 0
        else
          exit $EXIT_STATUS
        fi
            }
    
            output {
                    File eggnog_table = "${outputLabel}_eggnog.emapper.annotations"
            }
    }
    
  • ChrisLChrisL Cambridge, MAMember, Broadie, Moderator, Dev ✭✭

    Well the good news is that in WDL 1.0 the problem with ${} clashing with bash is fixed (by using the command <<< ~{} >>> style placeholders.

    If you're still using draft-2 (which you will be if your WDL file doesn't begin with version 1.0), there's a hack to work around it by bringing in the dollar as a variable (you still need the <<< >>> style to not accidentally catch the } from bash):

    task interpolate_dollar {
      String dollar = "$"
      ...
    
      command <<<
        ...
        if [ "${dollar}{EXIT_STATUS}" -eq 143 ]; then
      >>>
    }
    
    
  • Irsan_KooiIrsan_Kooi Member

    Have you tested that the workaround where you try to catch the sigint works? I think anything after the call to the python eggnog script is not evaluated

Sign In or Register to comment.