Firecloud jobs stopping unexpectedly - PAPI error code 10. 14

I have submitted a large computation to FC with ~4000 shards across 4 calls, all tasks allowing a few preemptibles. I noticed a number of shards eventually failing with the following code:

message: Task workflowAssembly.qcQualityHuman:160:3 failed. The job was stopped before the command finished.
PAPI error code 10. 14: VM ggp-9876237950776228486 stopped unexpectedly.

I know that this has been reported previously by various users. Did you ever figure out why this happens and how to prevent it? I call cache my results, but it's a very large amount of data that will be copied every time I re-run to successfully process the failed shards, and eventually aggregate my results.

[email protected]

Answers

  • SChaluvadiSChaluvadi Member, Broadie, Moderator admin

    @dplichta I believe that this error points to a known issue where non-preemptible machines fail as if they were preemptible with pipelines API v1. This bug has been filed with the Pipelines API team but v2 API should fix this error. FireCloud should be updating to Pipelines API v2 soon -- and we've not seen his behavior with v2 from Cromwell testing. For the current workaround, modifying Cromwell to handle the retries directly with maxRetries is an option.

    Let me know if you have further questions!

  • SChaluvadiSChaluvadi Member, Broadie, Moderator admin

    @dplichta Just wanted to check in and see if your jobs were still queued and if you have had a chance to restart the submission or have had any success with maxRetries?

  • dplichtadplichta Member

    @SChaluvadi, this worked so far well. Related, does mysql call caching in Firecloud update to the latest run when I re-run? If I need to re-run, for whatever reason, then I create copies of my data which is quite large and I need to clean failed runs from time to time. If call caching points to those old runs it will be invalid.

Sign In or Register to comment.