To celebrate the release of GATK 4.0, we are giving away free credits for running the GATK4 Best Practices pipelines in FireCloud, our secure online analysis portal. It’s first come first serve, so sign up now to claim your free credits worth $250. Sponsored by Google Cloud. Learn more at https://software.broadinstitute.org/firecloud/documentation/freecredits

Race condition for simple script causing a job to run forever?

mmahmmah Member, Broadie

I have a very simple python script that parses an Illumina filename for a lane identifier and writes this to stdout.

import sys
import re

# take a command line argument that uses a Illumina machine output
# read lane name from this filename
filename = sys.argv[1]
result = re.search("L([0-9]{3})", filename)
print(result.group(0))

For example, this takes the filename "Undetermined_S0_L001_R1_001.fastq.gz" and outputs "L001".

The WDL task looks like this:

task discover_lane_name_from_filename{
    String filename
    File python_lane_name

    command{
        python3 ${python_lane_name} ${filename}
    }
    output{
        String lane = read_string(stdout())
    }
}

I am calling this as part of a scatter operation, so it runs more than once for different filenames. In my last workflow, this task ran 4 times for different inputs. 3/4 of these completed very quickly. 1/4 continued running for > 90 minutes. I checked the stdout file in the execution directory for the failing job, and it contains the correct output "L004", so the python script is completing successfully, but the job (running on SLURM) never completes.

My best guess is that this is a race condition; Cromwell is not expecting the job to complete so quickly, and is waiting for something to change before declaring the job complete. I understand that spawning new jobs to perform simple operations like this incurs lots of overhead.

How should I alter my workflow so that it runs consistently?

Issue · Github
by Geraldine_VdAuwera

Issue Number
1926
State
closed
Last Updated
Assignee
Array
Milestone
Array
Closed By
vdauwera

Best Answer

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hi @mmah, I'm not convinced the length of the jobs is what matters... What version of cromwell are you running on?

  • mmahmmah Member, Broadie

    Cromwell v25.

  • mmahmmah Member, Broadie

    The problem appears to be related to the state WaitingForReturnCodeFile. I see jobs enter this state, but not exit:

    [INFO] [04/05/2017 13:14:44.123] [cromwell-system-akka.dispatchers.backend-dispatcher-99] [akka://cromwell-system/user/SingleWorkflowRunnerActor/WorkflowManagerActor/WorkflowActor-612a5dcc-f952-40b3-98be-c62194d3fd91/WorkflowExecutionActor-612a5dcc-f952-40b3-98be-c62194d3fd91/612a5dcc-f952-40b3-98be-c62194d3fd91-EngineJobExecutionActor-ancientDNA_screen.discover_lane_name_from_filename:2:1/612a5dcc-f952-40b3-98be-c62194d3fd91-BackendJobExecutionActor-612a5dcc:ancientDNA_screen.discover_lane_name_from_filename:2:1/DispatchedConfigAsyncJobExecutionActor] DispatchedConfigAsyncJobExecutionActor [UUID(612a5dcc)ancientDNA_screen.discover_lane_name_from_filename:2:1]: Status change from - to WaitingForReturnCodeFile

    For jobs that succeed, the execution directory contains a rc file with content that looks a like a return code: 0. For jobs that fail, there is a rc.tmp file.

    I don't know what the states are, or how a job transitions between states.

  • mcovarrmcovarr Cambridge, MAMember, Broadie, Dev

    Hi @mmah, Cromwell waits in WaitingForReturnCodeFile until the rc file appears, but it looks like that never happens here for some reason. Could you please email me any files that look like they were created by Cromwell in this directory? Thanks!

  • mcovarrmcovarr Cambridge, MAMember, Broadie, Dev

    Hi Matthew

    In the failed shards I see something like the following in the execution/stderr file:

    slurmstepd: error: *** JOB XXXXX ON YYYYY CANCELLED AT ZZZZZ DUE TO TIME LIMIT ***
    slurmstepd: error: proctrack_p_wait: Unable to destroy container 12345 in cgroup plugin, giving up after 128 sec
    

    It looks like something is going wrong where SLURM thinks the job continues to run and yet SLURM is unable to kill it. If SLURM thinks the job is still running then Cromwell will too, which explains the hanging. Unfortunately I don't know any more about SLURM, but for further debugging the script.submit file in the execution directory is what Cromwell actually used to submit the job. Please let us know if there's anything more we can do to help.

    Thanks

    Miguel

  • mmahmmah Member, Broadie

    What process writes the rc file? Is this done in the parent process, or in the child process? Where can I find the code that does this?

  • mcovarrmcovarr Cambridge, MAMember, Broadie, Dev

    execution/script does something like

    your_command
    echo $? > rc.tmp
    mv rc.tmp rc
    
  • mmahmmah Member, Broadie
    your_command
    echo $? > rc.tmp
    sync
    mv rc.tmp rc
    

    sync is a likely candidate for nondeterministic behavior.

  • mcovarrmcovarr Cambridge, MAMember, Broadie, Dev

    Yeah, our team discussed this today. I'm going to remove it from 25 hotfix and 26+.

  • mmahmmah Member, Broadie

    I understand that sync is slated for removal in v26. I will wait for the v26 release and recheck my workflow then.

Sign In or Register to comment.