We've moved!
For WDL questions, see the WDL specification and WDL docs.
For Cromwell questions, see the Cromwell docs and please post any issues on Github.

Race condition for simple script causing a job to run forever?

mmahmmah Member, Broadie ✭✭

I have a very simple python script that parses an Illumina filename for a lane identifier and writes this to stdout.

import sys
import re

# take a command line argument that uses a Illumina machine output
# read lane name from this filename
filename = sys.argv[1]
result = re.search("L([0-9]{3})", filename)

For example, this takes the filename "Undetermined_S0_L001_R1_001.fastq.gz" and outputs "L001".

The WDL task looks like this:

task discover_lane_name_from_filename{
    String filename
    File python_lane_name

        python3 ${python_lane_name} ${filename}
        String lane = read_string(stdout())

I am calling this as part of a scatter operation, so it runs more than once for different filenames. In my last workflow, this task ran 4 times for different inputs. 3/4 of these completed very quickly. 1/4 continued running for > 90 minutes. I checked the stdout file in the execution directory for the failing job, and it contains the correct output "L004", so the python script is completing successfully, but the job (running on SLURM) never completes.

My best guess is that this is a race condition; Cromwell is not expecting the job to complete so quickly, and is waiting for something to change before declaring the job complete. I understand that spawning new jobs to perform simple operations like this incurs lots of overhead.

How should I alter my workflow so that it runs consistently?

Issue · Github
by Geraldine_VdAuwera

Issue Number
Last Updated
Closed By

Best Answer


  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @mmah, I'm not convinced the length of the jobs is what matters... What version of cromwell are you running on?

  • mmahmmah Member, Broadie ✭✭

    Cromwell v25.

  • mmahmmah Member, Broadie ✭✭

    The problem appears to be related to the state WaitingForReturnCodeFile. I see jobs enter this state, but not exit:

    [INFO] [04/05/2017 13:14:44.123] [cromwell-system-akka.dispatchers.backend-dispatcher-99] [akka://cromwell-system/user/SingleWorkflowRunnerActor/WorkflowManagerActor/WorkflowActor-612a5dcc-f952-40b3-98be-c62194d3fd91/WorkflowExecutionActor-612a5dcc-f952-40b3-98be-c62194d3fd91/612a5dcc-f952-40b3-98be-c62194d3fd91-EngineJobExecutionActor-ancientDNA_screen.discover_lane_name_from_filename:2:1/612a5dcc-f952-40b3-98be-c62194d3fd91-BackendJobExecutionActor-612a5dcc:ancientDNA_screen.discover_lane_name_from_filename:2:1/DispatchedConfigAsyncJobExecutionActor] DispatchedConfigAsyncJobExecutionActor [UUID(612a5dcc)ancientDNA_screen.discover_lane_name_from_filename:2:1]: Status change from - to WaitingForReturnCodeFile

    For jobs that succeed, the execution directory contains a rc file with content that looks a like a return code: 0. For jobs that fail, there is a rc.tmp file.

    I don't know what the states are, or how a job transitions between states.

  • mcovarrmcovarr Cambridge, MAMember, Broadie, Dev ✭✭

    Hi @mmah, Cromwell waits in WaitingForReturnCodeFile until the rc file appears, but it looks like that never happens here for some reason. Could you please email me any files that look like they were created by Cromwell in this directory? Thanks!

  • mcovarrmcovarr Cambridge, MAMember, Broadie, Dev ✭✭

    Hi Matthew

    In the failed shards I see something like the following in the execution/stderr file:

    slurmstepd: error: proctrack_p_wait: Unable to destroy container 12345 in cgroup plugin, giving up after 128 sec

    It looks like something is going wrong where SLURM thinks the job continues to run and yet SLURM is unable to kill it. If SLURM thinks the job is still running then Cromwell will too, which explains the hanging. Unfortunately I don't know any more about SLURM, but for further debugging the script.submit file in the execution directory is what Cromwell actually used to submit the job. Please let us know if there's anything more we can do to help.



  • mmahmmah Member, Broadie ✭✭

    What process writes the rc file? Is this done in the parent process, or in the child process? Where can I find the code that does this?

  • mcovarrmcovarr Cambridge, MAMember, Broadie, Dev ✭✭

    execution/script does something like

    echo $? > rc.tmp
    mv rc.tmp rc
  • mmahmmah Member, Broadie ✭✭
    echo $? > rc.tmp
    mv rc.tmp rc

    sync is a likely candidate for nondeterministic behavior.

  • mcovarrmcovarr Cambridge, MAMember, Broadie, Dev ✭✭

    Yeah, our team discussed this today. I'm going to remove it from 25 hotfix and 26+.

  • mmahmmah Member, Broadie ✭✭

    I understand that sync is slated for removal in v26. I will wait for the v26 release and recheck my workflow then.

Sign In or Register to comment.