Consistent 503 Error For engine functions
I am using Cromwell 28.2 in production, and when submitting a large number of jobs, 1/4 - 1/3 of them seem to fail with a
503 service unavailable error, when using the
size engine functions. The jobs are running on top of Google with the JES backend. I have set the number of retries on API timeout to 5, however, I am not observing any retries for the
size function. Instead, the entire WF immediately fails.
From what I can tell, it does not appear that in this version of Cromwell, there are retries happening when the engine function receives a timeout or an error from the Google API. Is this fixed in later versions of Cromwell? Is this something that a config option can fix?