Intermittent problems causing large numbers of workflow failures

birgerbirger ✭✭✭Member, Broadie, CGA-mod ✭✭✭

Yesterday morning I launched the CGA Production WES workflow on two TCGA cohorts: THCA with 402 tumor/normal pairs and LUAD with 267 tumor/normal pairs. In each case, I got a large number of workflow failures which mostly appear to be due to transient failures (I've only rerun the workflow on selected failed pairs, but in those cases the re-run workflows succeeded). In the case of the THCA cohort, I experienced a 35% failure rate of my workflows (three of these workflows are stuck in the submitted state (i.e., no tasks have run). In the case of the LUAD cohort, I experienced a 22% failure rate of my workflows. I will re-run the workflows on a pair set containing the pairs that previously failed, and report back on the results. Regardless, these failure rates, which appear to be attributable to congestion in the workflow engine, are way too high and are a real problem for us, preventing us from leveraging the cloud's elastic compute to the degree necessary for us to conduct our research.

Answers

  • KateNKateN admin Cambridge, MAMember, Broadie, Moderator admin

    Thank you for reporting this. I've looped in the team to see if we can determine why this was happening and proceed from there. I will let you know when I have more information for you.

  • KateNKateN admin Cambridge, MAMember, Broadie, Moderator admin

    Lately, for reasons that are under investigation, our execution engine has been entering an unresponsive state where its garbage collection takes over 100% of CPU. This can cause all sorts of weird and erroneous behavior. Needless to say, our developers are investigating aggressively.

    There is a hotfix going out into FireCloud this week, which aims to address these issues.

Sign In or Register to comment.