Latest Release: 05/01/19
Release Notes can be found here.

failed workflows due to an Agora failure?

birgerbirger Member, Broadie, CGA-mod ✭✭✭

On 2/11 at around 10:30 AM I submitted an analysis that launched 1000 simple, single-task, workflows. 179 of these failed immediately with the following error:

ErrorReport(rawls,Unable to query the method repo.,Some(502 Bad Gateway),List(ErrorReport(agora,Ask timed out on [Actor[akka://rawls/user/IO-HTTP#1029185287]] after [60000 ms],None,WrappedArray(),WrappedArray(akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:333), akka.actor.Scheduler$$anon$7.run(Scheduler.scala:117), scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601), scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109), scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599), akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(Scheduler.scala:467), akka.actor.LightArrayRevolverScheduler$$anon$8.executeBucket$1(Scheduler.scala:419), akka.actor.LightArrayRevolverScheduler$$anon$8.nextTick(Scheduler.scala:423), akka.actor.LightArrayRevolverScheduler$$anon$8.run(Scheduler.scala:375), java.lang.Thread.run(Thread.java:745)),Some(class akka.pattern.AskTimeoutException))),List(),None)
BRCA-A7-A13E (participant)
02/11/2017 at 10:28:11 AM (a day ago)
warning
Failed

The remaining 821 succeeded. I believe this was a transient condition (with Agora?), but nevertheless wanted to report it.

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi Chet, @birger

    That's right, Agora choked on the load. The team had been working on a plan to beef it up already, and it sounds like they're aiming to have the new scaled up "Big Agora" go into service by EOD on Tuesday. In the meantime, if you have any other large numbers of workflows to submit, you might want to launch them in a few batches.

  • ChipChip 415M 4053Member, Broadie

    Hi

    On 6 Mar 2017, workflow "damage1_table_plot_workflow" in workspace broad-firecloud-pcawg/damage
    ran successfully on 6 samples but failed on 274 samples (228 are still running) in sample_set LUAD-TP (submission id 34f579d6-527c-43a6-885e-277443106834) with messages like this:

    ErrorReport(rawls,Unable to query the method repo.,Some(502 Bad Gateway),List(ErrorReport(agora,Ask timed out on [Actor[akka://rawls/user/IO-HTTP#-207631981]] after [60000 ms],None,WrappedArray(),WrappedArray(akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:333), akka.actor.Scheduler$$anon$7.run(Scheduler.scala:117), scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601), scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109), scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599), akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(Scheduler.scala:467), akka.actor.LightArrayRevolverScheduler$$anon$8.executeBucket$1(Scheduler.scala:419), akka.actor.LightArrayRevolverScheduler$$anon$8.nextTick(Scheduler.scala:423), akka.actor.LightArrayRevolverScheduler$$anon$8.run(Scheduler.scala:375), java.lang.Thread.run(Thread.java:745)),Some(class akka.pattern.AskTimeoutException))),List(),None)

    or

    ErrorReport(rawls,Unable to query the method repo.,Some(502 Bad Gateway),List(ErrorReport(agora,Status: 503 Service Unavailable Body: <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN"> 503 Service Unavailable

    Service Unavailable

    The server is temporarily unable to service your request due to maintenance downtime or capacity problems. Please try again later.


    Apache Server at agora-priv.dsde-prod.broadinstitute.org Port 443

    ,Some(503 Service Unavailable),WrappedArray(),List(),Some(class spray.httpx.UnsuccessfulResponseException))),List(),None)

    Does this indicate a failure within my wdl, a hardware failure, agora issues, or something else?

    Thanks,

    Chip
    @birger
    @eddieasalinas

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @Chip, @birger, @eddieasalinas

    Agora crashed due to a memory shortage around 6pm and was down for about 10 minutes. The memory provisioning has been fixed and Agora is back up. Apologies for the service interruption!

  • ChipChip 415M 4053Member, Broadie

    Hi Geraldine,

    Thanks for the explanation! If I were to submit the same job on the same pairset, would Firecloud job-avoid the pairs that succeeded and run only the pairs that failed due to Agora? The other option would be to make pair-sets of the failed jobs and run only the failed pair-set.

    Chip
    @birger, @eddieasalinas

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    I don't think we have job avoidance working yet so I believe you'll need to rerun the pairs that failed specifically -- @abaumann is that right?

  • abaumannabaumann Broad DSDEMember, Broadie ✭✭✭

    Yes call caching (job avoidance) is coming out when we release Cromwell 25, which is going into testing at the moment

Sign In or Register to comment.