Heads up:
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.

Workflow and task failures

kanderkakanderka BroadMember, Broadie

Hello! I'm running a WDL on several (1200+) genomes and so far, all of them have failed.

I'm getting different errors that appear to point to similar, or maybe the same, problem(s).

Some examples are:
Unexpected failure or termination of the actor monitoring PairedEndSingleSampleWorkflow.CheckFinalVcfExtension:NA:1

Unexpected failure or termination of the actor monitoring PairedEndSingleSampleWorkflow.ScatterIntervalList:NA:1

Unexpected failure or termination of the actor monitoring PairedEndSingleSampleWorkflow.CreateSequenceGroupingTSV:NA:1

Unexpected failure or termination of the actor monitoring PairedEndSingleSampleWorkflow.SortBam:NA:3

This is again for a customer for which we have privacy agreements in place, I'd be happy to share the workspace and details outside of the forum.

Thank you for looking into it!
Kristin

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @kanderka, this looks like Cromwell is misbehaving -- let me check with the on-call engineer. In the meantime, can you give me ballpark times of when you submitted these and when they failed? In case they need to check the system logs at a particular timepoint.

  • kanderkakanderka BroadMember, Broadie

    Hi @Geraldine_VdAuwera - sure, I submitted these last Friday 2/23 and since there are so many samples and I figured it would take a while, I didn't notice the failures until yesterday. Looking through a few of them it looks like a lot of the failures are between 2/23 and Monday 2/26. Thanks!

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Ok that's useful, thanks. Can you please share the workspace with [email protected]?

  • kanderkakanderka BroadMember, Broadie

    yes, all set!

  • mcovarrmcovarr Cambridge, MAMember, Broadie, Dev ✭✭

    Hi Kristin, which workspace is this?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @kanderka, we found out that your issue was triggered by a restart of the Cromwell system on the 26th. Normally restarts shouldn't cause submissions t fail, but we've seen some problems related to the timing of operations that take place when the server restarts, leading to failures like what you encountered. The engineering team is actively working on improving the system to avoid this problem in the future.

    So in short there is nothing wrong in your workspace and we expect that if you redo the submission, it should just work now.

  • kanderkakanderka BroadMember, Broadie

    Thanks, @Geraldine_VdAuwera!

    Will FireCloud call cache if samples are still "running" and not "succeeded"? I can't find an easy way to just select my 166 failures out of over 1200 samples, so I'm wondering if I can just launch another analysis and FireCloud will ignore what is already running?

    cc @KateN

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Ah, if you have things still running then it's a bit different -- the system will only recognize what has already been completed, so you could end up with workflows running redundantly neck-to-neck. For some reason I thought all your workflows had failed, not just a subset. The simplest would be to wait out what's already running. If that's not an option (eg if these are hugely long-running) you could abort the running workflows and relaunch everything. Call-caching will take into account all work that has already been successfully completed, and you won't have to worry about race conditions. It does mean you'll be throwing away any tasks that are actively running, unfortunately -- that's why waiting out is my first recommendation.

Sign In or Register to comment.