Latest Release: 05/01/19
Release Notes can be found here.

Handle benign failure messages more appropriately

birgerbirger Member, Broadie, CGA-mod ✭✭✭
edited July 2018 in Feature Requests

I (re-)ran our CGA somatic variant calling pipeline on 402 TCGA THCA pairs. In addition to the expected failures due to congestion in rawls (see
https://gatkforums.broadinstitute.org/firecloud/discussion/11860/rawls-failure-in-10-of-402-workflows-launched-in-single-submission#latest ), there were also four failures in workflows, mid-operation, all appearing to be associated with container creation. I reran these four jobs and they all ran successfully through completion. The error messages for these four failed workflows were:

message: Workflow failed
causedBy: 
message: Task Clinical_Workflow.Mutect2_Task:6:1 failed. Job exited without an error, exit code 0. PAPI error code 10. Message: 15: Gsutil failed: Could not capture docker logs: Unable to capture docker logs exit status 1
message: Cromwell server was restarted while this workflow was running. As part of the restart process, Cromwell attempted to reconnect to this job, however it was never started in the first place. This is a benign failure and not the cause of failure for this workflow, it can be safely ignored.

message: Workflow failed
causedBy: 
message: Task Clinical_Workflow.Mutect1_Task:5:1 failed. The job was stopped before the command finished. PAPI error code 10. Message: 11: Docker run failed: command failed: docker: error during connect: Post http://%2Fvar%2Frun%2Fdocker.sock/v1.37/containers/create: read unix @->/var/run/docker.sock: read: connection reset by peer. See 'docker run --help'. . See logs at gs://fc-ce9e4f8c-2c1f-4d67-94e7-4170daa0c81d/5e9c7d0c-ae1d-4213-9cdb-b4ef91c25f9f/Clinical_Workflow/fcb66940-3deb-4cef-9439-a4bcf800d6d2/call-Mutect1_Task/shard-5/

message: Workflow failed
causedBy: 
message: Task Clinical_Workflow.Mutect1_Task:5:1 failed. The job was stopped before the command finished. PAPI error code 10. Message: 11: Docker run failed: command failed: docker: error during connect: Post http://%2Fvar%2Frun%2Fdocker.sock/v1.37/containers/create: read unix @->/var/run/docker.sock: read: connection reset by peer. See 'docker run --help'. . See logs at gs://fc-ce9e4f8c-2c1f-4d67-94e7-4170daa0c81d/5e9c7d0c-ae1d-4213-9cdb-b4ef91c25f9f/Clinical_Workflow/546b7656-1e1d-4547-bc6c-1e9c22dc2526/call-Mutect1_Task/shard-5/
message: Cromwell server was restarted while this workflow was running. As part of the restart process, Cromwell attempted to reconnect to this job, however it was never started in the first place. This is a benign failure and not the cause of failure for this workflow, it can be safely ignored.

message: Workflow failed
causedBy: 
message: Cromwell server was restarted while this workflow was running. As part of the restart process, Cromwell attempted to reconnect to this job, however it was never started in the first place. This is a benign failure and not the cause of failure for this workflow, it can be safely ignored.
  (11 copies of the same message)
    message: Cromwell server was restarted while this workflow was running. As part of the restart process, Cromwell attempted to reconnect to this job, however it was never started in the first place. This is a benign failure and not the cause of failure for this workflow, it can be safely ignored.
    message: Task Clinical_Workflow.normalMM_Task:NA:1 failed. The job was stopped before the command finished. PAPI error code 10. Message: 11: Docker run failed: command failed: docker: error during connect: Post http://%2Fvar%2Frun%2Fdocker.sock/v1.37/containers/create: read unix @->/var/run/docker.sock: read: connection reset by peer. See 'docker run --help'. . See logs at gs://fc-ce9e4f8c-2c1f-4d67-94e7-4170daa0c81d/5e9c7d0c-ae1d-4213-9cdb-b4ef91c25f9f/Clinical_Workflow/3004d191-8ea2-4a29-b934-4e71ac7f9a42/call-normalMM_Task/
    message: Cromwell server was restarted while this workflow was running. As part of the restart process, Cromwell attempted to reconnect to this job, however it was never started in the first place. This is a benign failure and not the cause of failure for this workflow, it can be safely ignored.
(8 copies of the same message)
    message: Cromwell server was restarted while this workflow was running. As part of the restart process, Cromwell attempted to reconnect to this job, however it was never started in the first place. This is a benign failure and not the cause of failure for this workflow, it can be safely ignored.

As mentioned, I reran these four failing workflows and they completed with no problem. Given how adamant the system was in telling me the restarting of cromwell did not cause the workflow failures, I am apt to believe Cromwell's restart did have a role in the workflow failures. Regardless, I'd like to understand the source of these intermittent failures.

Here is information on the failures:

Google Project: cloud-resource-miscellaneous
Workspace: CBB_20180405_TCGA_THCA_ControlledAccess_V1-0_DATA
Submission ID: 5e9c7d0c-ae1d-4213-9cdb-b4ef91c25f9f
Workflow IDs: a8b9ae04-52bf-476f-a4cc-bee63f5aa013, fcb66940-3deb-4cef-9439-a4bcf800d6d2, 546b7656-1e1d-4547-bc6c-1e9c22dc2526, 3004d191-8ea2-4a29-b934-4e71ac7f9a42

Post edited by Geraldine_VdAuwera on
Tagged:
0
0 votes

Active · Last Updated

Comments

  • KateNKateN Cambridge, MAMember, Broadie, Moderator admin

    I'm looping in some of our developers because we are debating the meaning/cause of this error message. I am unable to see the workspace you mentioned, so would you double-check that it is shared with [email protected]? I may not be able to see it simply because I don't have TCGA access, but I wanted to double check this as well.

  • birgerbirger Member, Broadie, CGA-mod ✭✭✭

    I just shared the workspace with [email protected] The workspace is in the dbGaP TCGA authorization domain.

  • abaumannabaumann Broad DSDEMember, Broadie ✭✭✭

    PAPI error code 10 looks like the actual cause of the issues (error 10 is a sort of catch all for Google). I checked with the team and the restart issue should be resolved, and I'm suggesting that if it's always benign, we shouldn't show this in the Failures section, and perhaps in a warning or just in workflow logs.

    I think what may have happened here is that workflows in FireCloud by default are set to "fail fast" which means that upon any real failure, the workflow will not continue to progress, and so when these other jobs failed for real reasons the workflow failed and those benign issues just never got picked back up by Cromwell.

  • birgerbirger Member, Broadie, CGA-mod ✭✭✭

    If PAPI error code 10 was the actual cause of the job failures, why did I get these failures? Re-running the failed workflows with no changes, they successfully ran to completion.

  • lucdhlucdh Member ✭✭

    We are running cromwell-36 on Google Cloud (directly, not using FIreCloud) and got the PAPI error 10 message below. As reported elsewhere the error did not show again after relaunching the workflow.
    Since this error does not seem to be disruptive, is there any (cromwell) setting we can use to tolerate it and let the pipeline continue?

    2018-11-21 01:16:09,259 cromwell-system-akka.dispatchers.engine-dispatcher-30 ERROR - WorkflowManagerActor Workflow f292626b-cf36-4ae6-a254-b405d4563cab failed (during ExecutingWorkflowState): java.lang.Exception: Task MyWorkflos.MyTask:11:1 failed. Job exited without an error, exit code 0. PAPI error code 10. 15: Gsutil failed: Could not capture docker logs: failed to acquire logs: exit status 1
        at cromwell.backend.google.pipelines.common.PipelinesApiAsyncBackendJobExecutionActor$.StandardException(PipelinesApiAsyncBackendJobExecutionActor.scala:79)
        at cromwell.backend.google.pipelines.common.PipelinesApiAsyncBackendJobExecutionActor.handleFailedRunStatus$1(PipelinesApiAsyncBackendJobExecutionActor.scala:585)
        at cromwell.backend.google.pipelines.common.PipelinesApiAsyncBackendJobExecutionActor.handleExecutionFailure(PipelinesApiAsyncBackendJobExecutionActor.scala:592)
        at cromwell.backend.google.pipelines.common.PipelinesApiAsyncBackendJobExecutionActor.handleExecutionFailure(PipelinesApiAsyncBackendJobExecutionActor.scala:83)
        at cromwell.backend.standard.StandardAsyncExecutionActor.$anonfun$handleExecutionResult$3(StandardAsyncExecutionActor.scala:1092)
        at scala.concurrent.Future.$anonfun$flatMap$1(Future.scala:303)
        at scala.concurrent.impl.Promise.$anonfun$transformWith$1(Promise.scala:37)
        at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:60)
        at akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55)
        at akka.dispatch.BatchingExecutor$BlockableBatch.$anonfun$run$1(BatchingExecutor.scala:91)
        at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12)
        at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:81)
        at akka.dispatch.BatchingExecutor$BlockableBatch.run(BatchingExecutor.scala:91)
        at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40)
        at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:44)
        at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
        at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
        at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
        at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
    
    2018-11-21 01:16:09,262 cromwell-system-akka.dispatchers.engine-dispatcher-30 INFO  - WorkflowManagerActor WorkflowActor-f292626b-cf36-4ae6-a254-b405d4563cab is in a terminal state: WorkflowFailedState
    
  • RuchiRuchi Member, Broadie, Moderator, Dev admin

    @lucdh Have you tried using https://cromwell.readthedocs.io/en/stable/RuntimeAttributes/#maxretries?
    This runtime attribute retries any failure mode that generates a non-zero return code file.

Sign In or Register to comment.