Cromwell 28.2 Consistently Fails after Preempted Scatter

I have been noticing odd behaviour when running scatter jobs with Cromwell 28 on JES

When the final task (before the gather) is running within a scatter block and if it gets preempted, the next dependent step outside of the scatter block will start. At the same time the second attempt of the Preempted instance will start, and the entire run will fail.

In this scenario I am sorting bam files in a scatter operation and then subsequently Gathering the bam files in a merge step.

workflow wf {
     Array[File] input_bams
     scatter(bam in input_bams){
        call Sort {
          input:
            bam = bam
        }
     }
    call MergeBams {
      input:
         bams = sort.sorted_bam
   }
}

When the call to Sort is preempted, the outputs for that task are not resolved and an empty list is passed to the bams input field for the MergeBams. MergeBams then starts but fails to delocalize its files because there were no input files to begin with.

The Relevant excerpts from the Metadata are below:

"wf.MergeBamFiles": [{
      "preemptible": true,
      "retryableFailure": false,
      "executionStatus": "Failed",
      "shardIndex": -1,
      "jes": {
      },
      "runtimeAttributes": {
        "preemptible": "4",
        "failOnStderr": "false",
        "bootDiskSizeGb": "10",
        "disks": "local-disk 520 HDD",
        "continueOnReturnCode": "0",
        "cpu": "2",
        "noAddress": "false",
        "memory": "7.5 GB"
      },
      "inputs": {
        "input_bam": [],
        "disk_size": 250,
        "input_bam_index": [],
      },

      "returnCode": 1,

      "failures": [{
        "causedBy": [],
        "message": "Task wf.MergeBamFiles:1 failed. JES error code 5.  Message: 10: Failed to delocalize files: failed to copy the following files: ..."
      }],
      "backend": "JES",
      "end": "2017-09-04T18:15:04.438Z",
      "attempt": 1,
      "executionEvents": [{
        "startTime": "2017-09-04T18:12:06.940Z",
        "description": "RequestingExecutionToken",
        "endTime": "2017-09-04T18:12:06.941Z"
      }, {
        "startTime": "2017-09-04T18:12:06.940Z",
        "description": "Pending",
        "endTime": "2017-09-04T18:12:06.940Z"
      }, {
        "startTime": "2017-09-04T18:12:06.941Z",
        "description": "PreparingJob",
        "endTime": "2017-09-04T18:12:06.953Z"
      }, {
        "startTime": "2017-09-04T18:12:06.953Z",
        "description": "RunningJob",
        "endTime": "2017-09-04T18:15:04.190Z"
      }, {
        "startTime": "2017-09-04T18:15:04.190Z",
        "description": "UpdatingJobStore",
        "endTime": "2017-09-04T18:15:04.437Z"
      }],
      "start": "2017-09-04T18:12:06.940Z"
    }],
"wf.Sort": [{
      "preemptible": true,
      "retryableFailure": true,
      "executionStatus": "RetryableFailure",
      "backendStatus": "Failed",
      "shardIndex": 0,
      "inputs": {
        "input_bam": "....",
      },
      "failures": [{
        "causedBy": [],
        "message": "Task wf.Sort:0:1 failed. JES error code 10. Task 682b59c0-3fc1-4092-b57c-6e40ba9ef82e:Sort was preempted for the 1st time. The call will be restarted with another preemptible VM (max preemptible attempts number is 4). Error code 10. Message: 14: ...."
      }],
      "backend": "JES",
      "end": "2017-09-04T18:12:05.451Z",
      "attempt": 1,
      "executionEvents": [{
        "startTime": "2017-09-04T18:10:21.233Z",
        "description": "RunningJob",
        "endTime": "2017-09-04T18:12:05.344Z"
      }, {
        "startTime": "2017-09-04T18:10:20.811Z",
        "description": "Pending",
        "endTime": "2017-09-04T18:10:20.811Z"
      }, {
        "startTime": "2017-09-04T18:10:20.811Z",
        "description": "RequestingExecutionToken",
        "endTime": "2017-09-04T18:10:20.811Z"
      }, {
        "startTime": "2017-09-04T18:10:20.811Z",
        "description": "PreparingJob",
        "endTime": "2017-09-04T18:10:21.233Z"
      }, {
        "startTime": "2017-09-04T18:12:05.344Z",
        "description": "UpdatingJobStore",
        "endTime": "2017-09-04T18:12:05.451Z"
      }],
      "start": "2017-09-04T18:10:20.811Z"
    }, {
      "preemptible": true,
      "executionStatus": "Running",
      "backendStatus": "Running",
      "shardIndex": 0,
      "jes": {},
      "runtimeAttributes": {},

      "inputs": {
        "input_bam": "..."
      },
      attempt": 2,
      "start": "2017-09-04T18:12:05.920Z"
    }]
},

Best Answer

Answers

Sign In or Register to comment.