delay in call-cache hitting?

esalinasesalinas BroadMember, Broadie ✭✭✭
edited March 2017 in Ask the FireCloud Team

I ran a job. It was the first for call-caching so it was a designed to be a "miss". Indeed it did "miss".

Then, less than 5 min later, I asked Chet to run it to test the call-caching hitting. But when he ran it, he got a "miss".

20 minutes later I asked Gordon to run the same config but he did get a hit.

So does this indicate some delay in CC?

I tested call-caching with a separate different WDL, and got a hit (after re-submitting myself) right after a successful submission, but it was me submitting and hoping to get a "hit" using my credentials. In this case, the WDL had a two inputs (strings) and no outputs. Also, in this case, right after a submission finished I got a "hit" (success).

In contrast, with the WDL, that Chet and Gordon ran, it had numerous more inputs and it had about 20 outputs. Is there delay associated with WDL complexity? or with different users trying to use call-caching?

I wonder if anyone else has had similar experiences or observations.

Issue · Github
by Geraldine_VdAuwera

Issue Number
1927
State
closed
Last Updated
Assignee
Array
Milestone
Array
Closed By
vdauwera

Best Answers

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    It sounds plausible to me that there may be some lag while outputs get registered to the call cache for complex WDLs. I'll ask the Cromwell team to opine.

  • esalinasesalinas BroadMember, Broadie ✭✭✭

    @mcovarr

    Chet's WF which was expected to be a "hit" but was not : 3e2a5fbe-22db-4142-9cb8-8467ba912d3b
    Gordon's which was expected to be a "hit" and was : 2914a8f3-2ca9-4caf-87d4-c6d49b1d36b5
    My WF which I presumed would have been used to provide a match to Chet's cache-lookup : "b76da20f-0650-4d6e-86c7-676fcb399f47"

    @mcovarr so would it be possible then to run a job with CC enabled and properly set up in the WDL and docker, that if the same job were submitted but happened to get assigned to a different cromwell instance that had not run it before that it would result in a cache "hit"?

    @Geraldine_VdAuwera @mcovarr
    is there a cromwell upgrade planned where one cromwell instance will query other instance's cache DB before making a JES call? or having some kind of shared DB?

    -eddie

    Issue · Github
    by Geraldine_VdAuwera

    Issue Number
    1936
    State
    open
    Last Updated
    Assignee
    Array
    Milestone
    Array
  • mcovarrmcovarr Cambridge, MAMember, Broadie, Dev ✭✭

    Thanks Eddie. I'll require assistance from a FireClouder to check the Cromwell instance for those workflow IDs.

    There isn't currently a way to share CC information with > 1 Cromwell instance, but please enter a ticket for such an enhancement if you feel that would be valuable.

    Thanks

    Miguel

  • gordon123gordon123 BroadMember, Broadie
    edited April 2017

    @mcovarr So, if 2 jobs are submitted, the current setup will give the second job an erroneous cache miss with a 1-(1/N) chance, where N is the number of Cromwell instances. This does not seem scalable.

  • mcovarrmcovarr Cambridge, MAMember, Broadie, Dev ✭✭

    @gordon123 Yes that's the status quo for Cromwell 25, with N = 2 in FireCloud and N = 1 everywhere else. :smile: If this is highly impactful for you I'd recommend entering an enhancement request in the Cromwell repo so our PO can understand why and prioritize this accordingly. Thanks!

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @gordon123 Two solutions to this problem have been discussed: 1) enable multiple instances of Cromwell to coordinate their call caches, or 2) unify the Cromwell service in FireCloud to use only a single beefier Cromwell instance. Because the two-Cromwell state is unique to FC and was only intended to be a temporary solution to scaling needs anyway, the team has decided to go with the second option and put all effort toward unifying the Cromwell service.

    As a result, call caching will continue to be subject to this "50/50 chance until both instances have run it" limitation until we have the unified Cromwell service, which is probably not going to be completed this quarter but will remain a top priority until we get there.

  • esalinasesalinas BroadMember, Broadie ✭✭✭

    @Geraldine_VdAuwera thanks for the update. @Geraldine_VdAuwera @mcovarr did any "fireclouder" by chance look at the logs and see which Cromwell(s) ran which workflows and see the actual reason there was a call-cache "Miss" ? (per @mcovarr post on April 6 : "Thanks Eddie. I'll require assistance.....")

  • gordon123gordon123 BroadMember, Broadie

    Another possible solution (discussed offline) was for Rawls to select the Cromwell instance based on a hash of the entity id and possibly other stuff, to give a consistent and mostly-balanced choice of instance. This would be a lighter-weight mitigation.

  • mcovarrmcovarr Cambridge, MAMember, Broadie, Dev ✭✭

    Hi @esalinas I checked with the FireCloud team and those workflows ran on Cromwells 2, 1, and 1 respectively, which is consistent with two Cromwells as the cause of the cache miss issue.

  • esalinasesalinas BroadMember, Broadie ✭✭✭

    @mcovarr thanks for the update!

Sign In or Register to comment.