Intermittent problems causing large numbers of workflow failures
Yesterday morning I launched the CGA Production WES workflow on two TCGA cohorts: THCA with 402 tumor/normal pairs and LUAD with 267 tumor/normal pairs. In each case, I got a large number of workflow failures which mostly appear to be due to transient failures (I've only rerun the workflow on selected failed pairs, but in those cases the re-run workflows succeeded). In the case of the THCA cohort, I experienced a 35% failure rate of my workflows (three of these workflows are stuck in the submitted state (i.e., no tasks have run). In the case of the LUAD cohort, I experienced a 22% failure rate of my workflows. I will re-run the workflows on a pair set containing the pairs that previously failed, and report back on the results. Regardless, these failure rates, which appear to be attributable to congestion in the workflow engine, are way too high and are a real problem for us, preventing us from leveraging the cloud's elastic compute to the degree necessary for us to conduct our research.