Update: July 26, 2019
This section of the forum is now closed; we are working on a new support model for WDL that we will share here shortly. For Cromwell-specific issues, see the Cromwell docs and post questions on Github.

Continue on SIGTERM code

One of my tasks submits a process which kills the process when it successfully completes (eggnog-mapper, I believe the SIGTERM is send). I want the workflow to continue since this is normal/successfull behaviour. I have tried to set the continueOnReturnCode to "true" and [0, 15], however the workflow keeps aborting and not submitting the succeeding task. Am I missing something?

   # The defaults for runtime attributes if not provided.
    default-runtime-attributes {
      failOnStderr: false
      continueOnReturnCode: true
    }

Answers

  • ChrisLChrisL Cambridge, MAMember, Broadie, Moderator, Dev admin

    Hi @Irsan_Kooi could you provide a little more context on what you're trying to do?

    • What the task is (eg could you give a pseudo-bash that demonstrates what your task is trying to do?)
    • What backend you're running on
    • What Cromwell and WDL version you're using
    • What the failure message looks like in the Cromwell logs

    Is 15 the correct exit code here? I believe that the convention is to have an exit code of 128 + signal and since SIGTERM is 15 I would expect the exit code to be 128+15 = 143.

  • Irsan_KooiIrsan_Kooi Member

    Using 128 + [0/9/15] does not help

    continueOnReturnCode: [0,137,143]
    

    The task sends a command to annotate protein sequences with eggnog mapper

    python2.7 /path/to/eggnog-mapper-master/emapper.py \
            -i /path/to/proteins.fasta \
            --output my_output_prefix \
            -d euk \ # eukarotes
            --usemem \ # load b in memory
            --cpu 20
    

    I believe that the shutdown behaviour (kill the process, send SIGTERM) is defined in this script, in the shutdown_server() method

    I use a local backend, submit with cromwell run (not server), wdl 0.14, cromwell 31.

    In the relevant task output dir I see the following in stderr.kill

    /path/to/my/execution/script.kill: line 8: kill: (142531) - No such process
    

    and /path/to/my/execution/script.kill contains:

    #!/bin/bash
    kill_children() {
      local pid=$1
      for cpid in $(pgrep -P $pid); do
        kill_children $cpid
      done
      echo killing $pid
      kill $pid
    }
    
    kill_children 142531
    

    The cromwell logs:

    [2018-07-11 08:01:41,60] [info] BackgroundConfigAsyncJobExecutionActor [692fd6e8augustus_eggnog_assembly.eggnog_map:NA:1]: executing: /bin/bash /home/irsan/projects/dev_workflow/cromwell-executions/augustus_eggnog_assembly/692fd6e8-764e-43df-b4c1-edaf6bb79238/call-eggnog_map/execution/script
    [2018-07-11 08:01:45,20] [info] BackgroundConfigAsyncJobExecutionActor [692fd6e8augustus_eggnog_assembly.eggnog_map:NA:1]: job id: 142531
    [2018-07-11 08:01:45,20] [info] BackgroundConfigAsyncJobExecutionActor [692fd6e8augustus_eggnog_assembly.eggnog_map:NA:1]: Status change from - to WaitingForReturnCodeFile
    ^An[2018-07-11 08:08:14,11] [info] Starting coordinated shutdown from JVM shutdown hook
    [2018-07-11 08:08:14,17] [info] Workflow polling stopped
    [2018-07-11 08:08:14,22] [info] Shutting down WorkflowStoreActor - Timeout = 5000 milliseconds
    [2018-07-11 08:08:14,26] [info] Shutting down WorkflowLogCopyRouter - Timeout = 5000 milliseconds
    [2018-07-11 08:08:14,29] [info] Shutting down JobExecutionTokenDispenser - Timeout = 5000 milliseconds
    [2018-07-11 08:08:14,29] [info] Aborting all running workflows.
    [2018-07-11 08:08:14,29] [info] JobExecutionTokenDispenser stopped
    [2018-07-11 08:08:14,31] [info] WorkflowStoreActor stopped
    [2018-07-11 08:08:14,32] [info] WorkflowLogCopyRouter stopped
    [2018-07-11 08:08:14,32] [info] Shutting down WorkflowManagerActor - Timeout = 3600000 milliseconds
    [2018-07-11 08:08:14,32] [info] WorkflowManagerActor Aborting all workflows
    [2018-07-11 08:08:14,33] [info] WorkflowExecutionActor-692fd6e8-764e-43df-b4c1-edaf6bb79238 [692fd6e8]: Aborting workflow
    [2018-07-11 08:08:14,51] [info] BackgroundConfigAsyncJobExecutionActor [692fd6e8augustus_eggnog_assembly.eggnog_map:NA:1]: BackgroundConfigAsyncJobExecutionActor [692fd6e8:augustus_eggnog_assembly.eggnog_map:NA:1] Aborted StandardAsyncJob(142531)
    [2018-07-11 08:08:31,06] [info] BackgroundConfigAsyncJobExecutionActor [692fd6e8augustus_eggnog_assembly.eggnog_map:NA:1]: Status change from WaitingForReturnCodeFile to Done
    [2018-07-11 08:08:31,22] [info] WorkflowExecutionActor-692fd6e8-764e-43df-b4c1-edaf6bb79238 [692fd6e8]: WorkflowExecutionActor [692fd6e8] aborted: augustus_eggnog_assembly.eggnog_map:NA:1
    [2018-07-11 08:08:32,21] [info] WorkflowManagerActor All workflows are aborted
    [2018-07-11 08:08:32,21] [info] WorkflowManagerActor stopped
    [2018-07-11 08:08:32,21] [info] WorkflowManagerActor All workflows finished
    [2018-07-11 08:08:32,21] [info] Connection pools shut down
    [2018-07-11 08:08:37,37] [info] Shutting down SubWorkflowStoreActor - Timeout = 1800000 milliseconds
    [2018-07-11 08:08:37,37] [info] Shutting down JobStoreActor - Timeout = 1800000 milliseconds
    [2018-07-11 08:08:37,37] [warn] Coordinated shutdown phase [service-stop] timed out after 5000 milliseconds
    [2018-07-11 08:08:37,37] [info] Shutting down CallCacheWriteActor - Timeout = 1800000 milliseconds
    [2018-07-11 08:08:37,37] [info] SubWorkflowStoreActor stopped
    [2018-07-11 08:08:37,41] [info] JobStoreActor stopped
    [2018-07-11 08:08:37,58] [warn] Couldn't find a suitable DSN, defaulting to a Noop one.
    [2018-07-11 08:08:37,62] [info] Using noop to send events.
    [2018-07-11 08:08:37,69] [info] Shutting down ServiceRegistryActor - Timeout = 1800000 milliseconds
    [2018-07-11 08:08:37,69] [info] Shutting down DockerHashActor - Timeout = 1800000 milliseconds
    [2018-07-11 08:08:37,69] [info] Shutting down IoProxy - Timeout = 1800000 milliseconds
    [2018-07-11 08:08:37,69] [info] CallCacheWriteActor Shutting down: 0 queued messages to process
    [2018-07-11 08:08:37,69] [info] CallCacheWriteActor stopped
    [2018-07-11 08:08:37,69] [info] WriteMetadataActor Shutting down: 0 queued messages to process
    [2018-07-11 08:08:37,69] [info] KvWriteActor Shutting down: 0 queued messages to process
    [2018-07-11 08:08:37,70] [info] DockerHashActor stopped
    [2018-07-11 08:08:37,70] [info] IoProxy stopped
    [2018-07-11 08:08:37,70] [info] ServiceRegistryActor stopped
    [2018-07-11 08:08:37,80] [info] Database closed
    [2018-07-11 08:08:37,80] [info] Stream materializer shut down
    [2018-07-11 08:08:37,84] [error] Outgoing request stream error
    akka.stream.AbruptTerminationException: Processor actor [Actor[akka://cromwell-system/user/StreamSupervisor-1/flow-52-0-mergePreferred#559363362]] terminated abruptly
    [2018-07-11 08:08:37,84] [info] Using noop to send events.
    
  • ChrisLChrisL Cambridge, MAMember, Broadie, Moderator, Dev admin

    It looks like the SIGTERM that your process is having sent to itself is indistinguishable from the SIGTERM that Cromwell would send when it aborts a job.

    Because of that, it looks like Cromwell thinks that the job has been aborted rather than completing normally, and so doesn't even get as far as the "was that a valid return code" check.

    I've made this an issue in our github repository for you to follow: https://github.com/broadinstitute/cromwell/issues/3896

    In the meantime, you might be able to work around this in one of two ways:

    • Change the signal sent to the task. I believe the local backend is hard-coded with a specific exit code to indicate aborted, so anything else should trigger the usual "was it a valid exit code" check.
    • Use the PAPI backend, which detects aborts in a different way so might not trigger this failure.
  • ChrisLChrisL Cambridge, MAMember, Broadie, Moderator, Dev admin

    Another possibility if neither of those appeal - perhaps you add a little catch to the end of your command section to catch the exit code from the python service and turn that into a more Cromwell-friendly exit code for the command as a whole? @kshakir @Ruchi do we have any examples of that sort of thing?

  • kshakirkshakir Broadie, Dev ✭✭

    Maybe as a workaround the command {} block could use capture the return code using sh, check for the 143 and then exit with something else.

    python -c "exit(143)"
    
    EXIT_STATUS=$?
    
    if [ "${EXIT_STATUS}" -eq 143 ]; then
      exit 0
    else
      exit "${EXIT_STATUS}"
    fi
    
  • Irsan_KooiIrsan_Kooi Member
    edited July 2018

    I think I am almost there, but the problem is that the "${EXIT_STATUS}" is an undefined variable (widdle tries to interpolate ${...} ? ) so I have tried the following which seems to have no effect

    task eggnog_map {
            String eggnog_mapper
            String python_v2
            File proteins_fasta
            Int num_cpu
            String database
            # get basename of input file, remove file suffix: anything that follows
            # after the first occurence of a dot
            String outputLabel = sub(basename(proteins_fasta),"\\..*","")
    
            command {
                    ${python_v2} ${eggnog_mapper} \
                            -i ${proteins_fasta} \
                            --output ${outputLabel}_eggnog \
                            -d ${database} \
                            --usemem \
                            --cpu ${num_cpu}
    
    
        EXIT_STATUS=$?
    
        if [ $EXIT_STATUS -eq 143 ]; then
          exit 0
        else
          exit $EXIT_STATUS
        fi
            }
    
            output {
                    File eggnog_table = "${outputLabel}_eggnog.emapper.annotations"
            }
    }
    
  • ChrisLChrisL Cambridge, MAMember, Broadie, Moderator, Dev admin

    Well the good news is that in WDL 1.0 the problem with ${} clashing with bash is fixed (by using the command <<< ~{} >>> style placeholders.

    If you're still using draft-2 (which you will be if your WDL file doesn't begin with version 1.0), there's a hack to work around it by bringing in the dollar as a variable (you still need the <<< >>> style to not accidentally catch the } from bash):

    task interpolate_dollar {
      String dollar = "$"
      ...
    
      command <<<
        ...
        if [ "${dollar}{EXIT_STATUS}" -eq 143 ]; then
      >>>
    }
    
    
  • Have you tested that the workaround where you try to catch the sigint works? I think anything after the call to the python eggnog script is not evaluated

Sign In or Register to comment.