Failure to load job from cache error

dannykwellsdannykwells San FranciscoMember ✭✭
edited September 2017 in Ask the Cromwell + WDL Team

On a particular part of one of our pipelines, I re-wrote the code to handle some edge cases, meaning that in 95% of cases it will return the same value. I re-ran the pipeline on a bam which it should (and does) return the same output it already did, triggering, I believe, a cache call to pull in a file which has previously been generated. However, upon doing this I got the following error (below). Is it clear what is going on here? It would be ideal if we could have caching working in these situations, since it would save us from having to re-run lots of data that we already have the right output for.

Also, I checked, and for example, I was able to directly copy "gs://nsclc-all-data/variant_calling/3b21821f-3cdb-425e-95e3-17324c635555/call-realign_bam_tumor/shard-0/attempt-2/AL4602_T1.bam" to my local machine. So it does not appear to be an issue with the file, gsutil or GCS.

Thanks!

[2017-09-12 15:37:32,49] [warn] Unrecognized configuration key(s) for Jes: compute-service-account
[2017-09-12 15:37:32,50] [warn] Couldn't find a suitable DSN, defaulting to a Noop one.
[2017-09-12 15:37:32,50] [info] Using noop to send events.
[INFO] [09/12/2017 15:37:33.109] [cromwell-system-akka.dispatchers.service-dispatcher-9] [akka://cromwell-system/user/SingleWorkflowRunnerActor/ServiceRegistryActor/MetadataService] Metadata summary refreshing every 2 seconds.
[INFO] [09/12/2017 15:37:33.125] [cromwell-system-akka.dispatchers.service-dispatcher-7] [akka://cromwell-system/user/SingleWorkflowRunnerActor/ServiceRegistryActor/MetadataService/WriteMetadataActor] WriteMetadataActor configured to write to the database with batch size 200 and flush rate 5 seconds.
[INFO] [09/12/2017 15:37:33.179] [cromwell-system-akka.dispatchers.engine-dispatcher-25] [akka://cromwell-system/user/SingleWorkflowRunnerActor/CallCacheWriteActor] CallCacheWriteActor configured to write to the database with batch size 100 and flush rate 3 seconds.
[INFO] [09/12/2017 15:37:33.939] [cromwell-system-akka.dispatchers.backend-dispatcher-33] [akka://cromwell-system/user/SingleWorkflowRunnerActor/$d/$a] watching Actor[akka://cromwell-system/user/SingleWorkflowRunnerActor/$d/$a/$a#995621296]
[INFO] [09/12/2017 15:37:33.949] [cromwell-system-akka.dispatchers.engine-dispatcher-5] [akka://cromwell-system/user/SingleWorkflowRunnerActor] SingleWorkflowRunnerActor: Submitting workflow
[INFO] [09/12/2017 15:37:33.997] [cromwell-system-akka.dispatchers.api-dispatcher-34] [akka://cromwell-system/user/SingleWorkflowRunnerActor/WorkflowStoreActor/WorkflowStoreSubmitActor] Workflow 9201c633-2a58-4c29-9a93-3997c3c39ebc submitted.
[INFO] [09/12/2017 15:37:33.998] [cromwell-system-akka.dispatchers.engine-dispatcher-24] [akka://cromwell-system/user/SingleWorkflowRunnerActor] SingleWorkflowRunnerActor: Workflow submitted UUID(9201c633-2a58-4c29-9a93-3997c3c39ebc)
[INFO] [09/12/2017 15:37:34.000] [cromwell-system-akka.dispatchers.engine-dispatcher-25] [akka://cromwell-system/user/SingleWorkflowRunnerActor/WorkflowStoreActor/WorkflowStoreEngineActor] 1 new workflows fetched
[INFO] [09/12/2017 15:37:34.001] [cromwell-system-akka.dispatchers.engine-dispatcher-5] [akka://cromwell-system/user/SingleWorkflowRunnerActor/WorkflowManagerActor] WorkflowManagerActor Starting workflow UUID(9201c633-2a58-4c29-9a93-3997c3c39ebc)
[INFO] [09/12/2017 15:37:34.007] [cromwell-system-akka.dispatchers.engine-dispatcher-5] [akka://cromwell-system/user/SingleWorkflowRunnerActor/WorkflowManagerActor] WorkflowManagerActor Successfully started WorkflowActor-9201c633-2a58-4c29-9a93-3997c3c39ebc
[INFO] [09/12/2017 15:37:34.007] [cromwell-system-akka.dispatchers.engine-dispatcher-5] [akka://cromwell-system/user/SingleWorkflowRunnerActor/WorkflowManagerActor] Retrieved 1 workflows from the WorkflowStoreActor
[2017-09-12 15:37:34,04] [warn] The PEM file format will be deprecated in the upcoming Cromwell version. Please use JSON instead.
[INFO] [09/12/2017 15:37:34.977] [cromwell-system-akka.dispatchers.engine-dispatcher-25] [akka://cromwell-system/user/SingleWorkflowRunnerActor/WorkflowManagerActor/WorkflowActor-9201c633-2a58-4c29-9a93-3997c3c39ebc/MaterializeWorkflowDescriptorActor] MaterializeWorkflowDescriptorActor [UUID(9201c633)]: Call-to-Backend assignments: variant_calling.get_tumor_name -> JES, variant_calling.bqsr_normal -> JES, variant_calling.realign_normal -> JES, variant_calling.bqsr_tumor -> JES, variant_calling.vcf2maf -> JES, variant_calling.dedup_tumor -> JES, variant_calling.g_and_c_tumor -> JES, variant_calling.realign_bam_normal -> JES, variant_calling.g_and_c_normal -> JES, variant_calling.realign_tumor -> JES, variant_calling.get_normal_name -> JES, variant_calling.corealignment -> JES, variant_calling.realign_bam_tumor -> JES, variant_calling.dedup_normal -> JES, variant_calling.call_variants -> JES
[2017-09-12 15:37:35,69] [warn] The PEM file format will be deprecated in the upcoming Cromwell version. Please use JSON instead.
[INFO] [09/12/2017 15:37:39.999] [cromwell-system-akka.dispatchers.engine-dispatcher-5] [akka://cromwell-system/user/SingleWorkflowRunnerActor/WorkflowManagerActor/WorkflowActor-9201c633-2a58-4c29-9a93-3997c3c39ebc/WorkflowExecutionActor-9201c633-2a58-4c29-9a93-3997c3c39ebc] WorkflowExecutionActor-9201c633-2a58-4c29-9a93-3997c3c39ebc [UUID(9201c633)]: Starting calls: variant_calling.get_normal_name:0:1, variant_calling.get_tumor_name:0:1
[INFO] [09/12/2017 15:37:52.232] [cromwell-system-akka.dispatchers.engine-dispatcher-58] [akka://cromwell-system/user/SingleWorkflowRunnerActor/WorkflowManagerActor/WorkflowActor-9201c633-2a58-4c29-9a93-3997c3c39ebc/WorkflowExecutionActor-9201c633-2a58-4c29-9a93-3997c3c39ebc] WorkflowExecutionActor-9201c633-2a58-4c29-9a93-3997c3c39ebc [UUID(9201c633)]: Starting calls: Collector-get_normal_name, Collector-get_tumor_name, variant_calling.realign_bam_normal:0:1, variant_calling.realign_bam_tumor:0:1
[ERROR] [09/12/2017 15:42:53.680] [cromwell-system-akka.dispatchers.engine-dispatcher-26] [akka://cromwell-system/user/SingleWorkflowRunnerActor/WorkflowManagerActor/WorkflowActor-9201c633-2a58-4c29-9a93-3997c3c39ebc/WorkflowExecutionActor-9201c633-2a58-4c29-9a93-3997c3c39ebc/9201c633-2a58-4c29-9a93-3997c3c39ebc-EngineJobExecutionActor-variant_calling.realign_bam_tumor:0:1] Failed copying cache results for job variant_calling.realign_bam_tumor:0:1, invalidating cache entry.
java.util.concurrent.TimeoutException: The Cache hit copying actor timed out waiting for a response to copy gs://nsclc-all-data/variant_calling/3b21821f-3cdb-425e-95e3-17324c635555/call-realign_bam_tumor/shard-0/attempt-2/AL4602_T1.bam to gs://cromwell-variant-calling-test/variant_calling/9201c633-2a58-4c29-9a93-3997c3c39ebc/call-realign_bam_tumor/shard-0/AL4602_T1.bam
    at cromwell.backend.standard.callcaching.StandardCacheHitCopyingActor.onTimeout(StandardCacheHitCopyingActor.scala:315)
    at cromwell.core.actor.RobustClientHelper$$anonfun$robustReceive$1.applyOrElse(RobustClientHelper.scala:33)
    at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
    at akka.actor.Actor.aroundReceive(Actor.scala:513)
    at akka.actor.Actor.aroundReceive$(Actor.scala:511)
    at cromwell.backend.standard.callcaching.StandardCacheHitCopyingActor.aroundReceive(StandardCacheHitCopyingActor.scala:110)
    at akka.actor.ActorCell.receiveMessage(ActorCell.scala:527)
    at akka.actor.ActorCell.invoke(ActorCell.scala:496)
    at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
    at akka.dispatch.Mailbox.run(Mailbox.scala:224)
    at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
    at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
    at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
    at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
    at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

[ERROR] [09/12/2017 15:42:53.685] [cromwell-system-akka.dispatchers.engine-dispatcher-26] [akka://cromwell-system/user/SingleWorkflowRunnerActor/WorkflowManagerActor/WorkflowActor-9201c633-2a58-4c29-9a93-3997c3c39ebc/WorkflowExecutionActor-9201c633-2a58-4c29-9a93-3997c3c39ebc/9201c633-2a58-4c29-9a93-3997c3c39ebc-EngineJobExecutionActor-variant_calling.realign_bam_normal:0:1] Failed copying cache results for job variant_calling.realign_bam_normal:0:1, invalidating cache entry.
java.util.concurrent.TimeoutException: The Cache hit copying actor timed out waiting for a response to copy gs://nsclc-all-data/variant_calling/10ccb631-8d6c-4ed4-991f-6d2af45b7c5a/call-realign_bam_normal/shard-0/attempt-2/AL4602_N1.bam to gs://cromwell-variant-calling-test/variant_calling/9201c633-2a58-4c29-9a93-3997c3c39ebc/call-realign_bam_normal/shard-0/AL4602_N1.bam
    at cromwell.backend.standard.callcaching.StandardCacheHitCopyingActor.onTimeout(StandardCacheHitCopyingActor.scala:315)
    at cromwell.core.actor.RobustClientHelper$$anonfun$robustReceive$1.applyOrElse(RobustClientHelper.scala:33)
    at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
    at akka.actor.Actor.aroundReceive(Actor.scala:513)
    at akka.actor.Actor.aroundReceive$(Actor.scala:511)
    at cromwell.backend.standard.callcaching.StandardCacheHitCopyingActor.aroundReceive(StandardCacheHitCopyingActor.scala:110)
    at akka.actor.ActorCell.receiveMessage(ActorCell.scala:527)
    at akka.actor.ActorCell.invoke(ActorCell.scala:496)
    at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
    at akka.dispatch.Mailbox.run(Mailbox.scala:224)
    at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
    at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
    at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
    at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
    at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

[INFO] [09/12/2017 15:42:53.714] [cromwell-system-akka.dispatchers.engine-dispatcher-26] [akka://cromwell-system/user/SingleWorkflowRunnerActor/WorkflowManagerActor/WorkflowActor-9201c633-2a58-4c29-9a93-3997c3c39ebc/WorkflowExecutionActor-9201c633-2a58-4c29-9a93-3997c3c39ebc/9201c633-2a58-4c29-9a93-3997c3c39ebc-EngineJobExecutionActor-variant_calling.realign_bam_normal:0:1] Trying to use another cache hit for job: variant_calling.realign_bam_normal:0:1
[INFO] [09/12/2017 15:42:53.714] [cromwell-system-akka.dispatchers.engine-dispatcher-58] [akka://cromwell-system/user/SingleWorkflowRunnerActor/WorkflowManagerActor/WorkflowActor-9201c633-2a58-4c29-9a93-3997c3c39ebc/WorkflowExecutionActor-9201c633-2a58-4c29-9a93-3997c3c39ebc/9201c633-2a58-4c29-9a93-3997c3c39ebc-EngineJobExecutionActor-variant_calling.realign_bam_tumor:0:1] Trying to use another cache hit for job: variant_calling.realign_bam_tumor:0:1
[INFO] [09/12/2017 15:46:47.208] [cromwell-system-akka.actor.default-dispatcher-44] [akka://cromwell-system/user/SingleWorkflowRunnerActor/WorkflowManagerActor/WorkflowActor-9201c633-2a58-4c29-9a93-3997c3c39ebc/WorkflowExecutionActor-9201c633-2a58-4c29-9a93-3997c3c39ebc/9201c633-2a58-4c29-9a93-3997c3c39ebc-EngineJobExecutionActor-variant_calling.realign_bam_tumor:0:1/9201c633-2a58-4c29-9a93-3997c3c39ebc-BackendCacheHitCopyingActor-9201c633:variant_calling.realign_bam_tumor:0:1-1228] Message [cromwell.core.io.IoSuccess] from Actor[akka://cromwell-system/user/SingleWorkflowRunnerActor/IoActor#-1529879979] to Actor[akka://cromwell-system/user/SingleWorkflowRunnerActor/WorkflowManagerActor/WorkflowActor-9201c633-2a58-4c29-9a93-3997c3c39ebc/WorkflowExecutionActor-9201c633-2a58-4c29-9a93-3997c3c39ebc/9201c633-2a58-4c29-9a93-3997c3c39ebc-EngineJobExecutionActor-variant_calling.realign_bam_tumor:0:1/9201c633-2a58-4c29-9a93-3997c3c39ebc-BackendCacheHitCopyingActor-9201c633:variant_calling.realign_bam_tumor:0:1-1228#-1088615075] was not delivered. [1] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
[ERROR] [09/12/2017 15:47:53.765] [cromwell-system-akka.dispatchers.engine-dispatcher-57] [akka://cromwell-system/user/SingleWorkflowRunnerActor/WorkflowManagerActor/WorkflowActor-9201c633-2a58-4c29-9a93-3997c3c39ebc/WorkflowExecutionActor-9201c633-2a58-4c29-9a93-3997c3c39ebc/9201c633-2a58-4c29-9a93-3997c3c39ebc-EngineJobExecutionActor-variant_calling.realign_bam_tumor:0:1] Failed copying cache results for job variant_calling.realign_bam_tumor:0:1, invalidating cache entry.
java.util.concurrent.TimeoutException: The Cache hit copying actor timed out waiting for a response to copy gs://nsclc-all-data/variant_calling/58e2dfc8-c296-4ee9-8ba1-51d3d4c3bc4e/call-realign_bam_tumor/shard-0/AL4602_T1.bam to gs://cromwell-variant-calling-test/variant_calling/9201c633-2a58-4c29-9a93-3997c3c39ebc/call-realign_bam_tumor/shard-0/AL4602_T1.bam
    at cromwell.backend.standard.callcaching.StandardCacheHitCopyingActor.onTimeout(StandardCacheHitCopyingActor.scala:315)
    at cromwell.core.actor.RobustClientHelper$$anonfun$robustReceive$1.applyOrElse(RobustClientHelper.scala:33)
    at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
    at akka.actor.Actor.aroundReceive(Actor.scala:513)
    at akka.actor.Actor.aroundReceive$(Actor.scala:511)
    at cromwell.backend.standard.callcaching.StandardCacheHitCopyingActor.aroundReceive(StandardCacheHitCopyingActor.scala:110)
    at akka.actor.ActorCell.receiveMessage(ActorCell.scala:527)
    at akka.actor.ActorCell.invoke(ActorCell.scala:496)
    at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
    at akka.dispatch.Mailbox.run(Mailbox.scala:224)
    at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
    at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
    at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
    at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
    at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

[ERROR] [09/12/2017 15:47:53.765] [cromwell-system-akka.dispatchers.engine-dispatcher-58] [akka://cromwell-system/user/SingleWorkflowRunnerActor/WorkflowManagerActor/WorkflowActor-9201c633-2a58-4c29-9a93-3997c3c39ebc/WorkflowExecutionActor-9201c633-2a58-4c29-9a93-3997c3c39ebc/9201c633-2a58-4c29-9a93-3997c3c39ebc-EngineJobExecutionActor-variant_calling.realign_bam_normal:0:1] Failed copying cache results for job variant_calling.realign_bam_normal:0:1, invalidating cache entry.
java.util.concurrent.TimeoutException: The Cache hit copying actor timed out waiting for a response to copy gs://nsclc-all-data/variant_calling/58e2dfc8-c296-4ee9-8ba1-51d3d4c3bc4e/call-realign_bam_normal/shard-0/AL4602_N1.bam to gs://cromwell-variant-calling-test/variant_calling/9201c633-2a58-4c29-9a93-3997c3c39ebc/call-realign_bam_normal/shard-0/AL4602_N1.bam
    at cromwell.backend.standard.callcaching.StandardCacheHitCopyingActor.onTimeout(StandardCacheHitCopyingActor.scala:315)
    at cromwell.core.actor.RobustClientHelper$$anonfun$robustReceive$1.applyOrElse(RobustClientHelper.scala:33)
    at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
    at akka.actor.Actor.aroundReceive(Actor.scala:513)
    at akka.actor.Actor.aroundReceive$(Actor.scala:511)
    at cromwell.backend.standard.callcaching.StandardCacheHitCopyingActor.aroundReceive(StandardCacheHitCopyingActor.scala:110)
    at akka.actor.ActorCell.receiveMessage(ActorCell.scala:527)
    at akka.actor.ActorCell.invoke(ActorCell.scala:496)
    at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
    at akka.dispatch.Mailbox.run(Mailbox.scala:224)
    at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
    at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
    at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
    at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
    at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) 
Tagged:

Best Answers

Answers

  • danbdanb Member, Broadie ✭✭✭
    edited September 2017

    can you paste your workflow options file? Looks like there is an issue w/ compute-service-account key.

    For reference, here are the related docs to this.

  • dannykwellsdannykwells San FranciscoMember ✭✭
    edited September 2017

    I'm not sure what file that is - I have a .conf file:

    "google {
    2
    3 application-name = "cromwell"
    4
    5 auths = [
    6 {
    7 name = "service-account"
    8 scheme = "service_account"
    9 service-account-id = "[email protected]"
    10 pem-file = "/home/dwells/variant-calling-wdl/mykey.pem"
    11 }
    12 ]
    13 }
    14
    15 engine {
    16 filesystems {
    17 gcs {
    18 auth = "service-account"
    19 }
    20 }
    21 }
    backend {
    24 default = "JES"
    25 providers {
    26 JES {
    27 actor-factory = "cromwell.backend.impl.jes.JesBackendLifecycleActorFactory"
    28 config {
    29 // Google project
    30 project = ""
    31 compute-service-account = "default"
    32
    33 // Base bucket for workflow executions
    34 root = "/cromwell-execution"
    35
    36 // Polling for completion backs-off gradually for slower-running jobs.
    37 // This is the maximum polling interval (in seconds):
    38 maximum-polling-interval = 600
    39
    40 // Optional Dockerhub Credentials. Can be used to access private docker images.
    41 dockerhub {
    42 // account = ""
    43 // token = ""
    44 }
    45
    46 genomics {
    47 // A reference to an auth defined in the google stanza at the top. This auth is used to create
    48 // Pipelines and manipulate auth JSONs.
    49 auth = "service-account"
    50 // Endpoint for APIs, no reason to change this unless directed by Google.
    51 endpoint-url = "https://genomics.googleapis.com/"
    52 }
    53
    54 filesystems {
    55 gcs {
    56 // A reference to a potentially different auth for manipulating files via engine functions.
    57 #//auth = "application-default"
    58 auth = "service-account"
    59 }
    60 }
    61 }
    62 }
    63 }
    64 }

    call-caching {
    67 # Allows re-use of existing results for jobs you've already run
    68 # (default: false)
    69 enabled = true
    70
    71 # Whether to invalidate a cache result forever if we cannot reuse them. Disable this if you expect some cache copies
    72 # to fail for external reasons which should not invalidate the cache (e.g. auth differences between users):
    73 # (default: true)
    74 #invalidate-bad-cache-results = true
    75 }
    76
    77 database {
    78 #driver = "slick.driver.MySQLDriver$"
    79 #db {
    80 # driver = "com.mysql.jdbc.Driver"
    81 # url = "jdbc:mysql://host/cromwell?rewriteBatchedStatements=true"
    82 # user = "user"
    83 # password = "pass"
    84 # connectionTimeout = 5000
    85 #}
    86
    87 # driver = "slick.driver.MySQLDriver$"
    88 profile = "slick.jdbc.MySQLProfile$"
    89 db {
    90 driver = "com.mysql.jdbc.Driver"
    91 url = "jdbc:mysql://<>:3306/cromwell?useSSL=false"
    92 user = "<>"
    93 password = "<>"
    94 connectionTimeout = 5000
    95 }

    and an options file:

    1 {
    2 "final_workflow_outputs_dir" : "gs://nsclc-all-data/cromwell/results",
    3 "final_workflow_log_dir" : "gs://nsclc-all-data/cromwell/logs",
    4 "workflow_failure_mode" : "ContinueWhilePossible"
    5 }

    Both of these have worked with call caching. However, yesterday a colleague began using a similar file as this to write his logs to the same table (i.e., with the same values). Maybe that is the problem?

  • RuchiRuchi Member, Broadie, Moderator, Dev admin

    Hey @dannykwells,

    Line 30 says project = "" which needs to be updated as you have to a declare a google project in order to use the JES backend.

  • dannykwellsdannykwells San FranciscoMember ✭✭

    Hi @Ruchi I think I'm declaring that when I call cromwell:

    java -Dconfig.file=google-adc.conf -Dbackend.providers.JES.config.project=pici-internal -Dbackend.providers.JES.config.root=gs://cromwell-variant-calling-test -jar ../cromwell-29.jar run variant-calling-on-bams-full.wdl -i variant-calling-test.inputs -o variant_calling_options.json

    I guess it's clear that I don't know which options go where, here.

  • dannykwellsdannykwells San FranciscoMember ✭✭

    Hi @Ruchi we are very interested in figuring out how to run in server mode! I read through the docs on this but I couldn't really figure out how to get it going with GCP. Do you have a google doc or something similar that could be shared on this? We are going to be scaling very quickly from 1 user (me) to 10+ in the next few months so this is a key priority to figure out!
    Thank you!

  • RuchiRuchi Member, Broadie, Moderator, Dev admin

    Hey @dannykwells,

    Some docs for running Cromwell in Server mode have been appended to a google doc you have access to and have seen previously.

    In addition, if you think you'll benefit from a meeting to go over different Cromwell configurations and any best practices, please let me know and we can have a meeting to help answer lingering questions. We are very excited to hear you're ramping up on WDL and Cromwell!

  • dannykwellsdannykwells San FranciscoMember ✭✭

    Hi @Ruchi

    1. Thanks!
    2. This would be very helpful. On of our engineers has begun looking into server more to set it up for our team, and I'll reach out to him to see what would be most useful. The docs on Swagger are great and I think will be a great place for us to start. I can reach out through the various channels we have to set up the meeting when we're ready (hopefully within a few weeks.)
  • dannykwellsdannykwells San FranciscoMember ✭✭

    Hi @Ruchi and @danb I am encountering this error over and over, meaning that I'm really not able to cache results. I believe the errors driving this are from of the form

    [INFO] [09/12/2017 15:46:47.208] [cromwell-system-akka.actor.default-dispatcher-44] [akka://cromwell-system/user/SingleWorkflowRunnerActor/WorkflowManagerActor/WorkflowActor-9201c633-2a58-4c29-9a93-3997c3c39ebc/WorkflowExecutionActor-9201c633-2a58-4c29-9a93-3997c3c39ebc/9201c633-2a58-4c29-9a93-3997c3c39ebc-EngineJobExecutionActor-variant_calling.realign_bam_tumor:0:1/9201c633-2a58-4c29-9a93-3997c3c39ebc-BackendCacheHitCopyingActor-9201c633:variant_calling.realign_bam_tumor:0:1-1228] Message [cromwell.core.io.IoSuccess] from Actor[akka://cromwell-system/user/SingleWorkflowRunnerActor/IoActor#-1529879979] to Actor[akka://cromwell-system/user/SingleWorkflowRunnerActor/WorkflowManagerActor/WorkflowActor-9201c633-2a58-4c29-9a93-3997c3c39ebc/WorkflowExecutionActor-9201c633-2a58-4c29-9a93-3997c3c39ebc/9201c633-2a58-4c29-9a93-3997c3c39ebc-EngineJobExecutionActor-variant_calling.realign_bam_tumor:0:1/9201c633-2a58-4c29-9a93-3997c3c39ebc-BackendCacheHitCopyingActor-9201c633:variant_calling.realign_bam_tumor:0:1-1228#-1088615075] was not delivered. [1] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.

    Do you know if caching works if the job did not result in a "success" because some fraction of the files (in this case, about 5% of them) running through error-ed out (I'm using a scatter command to run in parallel over 100+ T/N pairs)? That's my current hypothesis of what is going on here but I don't know. Is there some option to set where you can just grab any successful call of a particular task, even if that ultimately wasn't part of a successful run?

    Given a failure of the above, do you think it might work to wipe the SQL database and start over? If so, do you know how to I could do that?

    Thanks,
    d

  • danbdanb Member, Broadie ✭✭✭

    Hi @dannykwells , try the setting:

    "workflow_failure_mode": "ContinueWhilePossible"
    

    in your workflow options json file.

    This will continue to run the other jobs in the scatter and cache them if successful.

  • dannykwellsdannykwells San FranciscoMember ✭✭

    I'm currently doing that:

    Here's my options.json file:
    {
    2 "final_workflow_outputs_dir" : "<>",
    3 "final_workflow_log_dir" : "<>",
    4 "workflow_failure_mode" : "ContinueWhilePossible"
    5 }

  • dannykwellsdannykwells San FranciscoMember ✭✭

    As a follow up @danb, I have independent confirmation from my colleague that, during a scatter job, if a task fails on one job in the scatter, cromwell is not caching the results from any of the jobs in that scatter. This is likely the issue. Maybe we didn't set something up correctly?

  • danbdanb Member, Broadie ✭✭✭

    Are you running in server mode? If so you can use the call caching diff endpoint to debug the calls.

    The "shard index" can be ascertained from the workflow metadata, which is in turned retrieved via the metadata endpoint.

  • dannykwellsdannykwells San FranciscoMember ✭✭

    We are not running in server mode - we're building out our front end for it currently so it will be up in a month or so. Maybe that's the problem ? Any tips for how to debug in single user mode?

  • danbdanb Member, Broadie ✭✭✭

    You can also use the metadata endpoint to verify that the workflow options are properly accounted for, (I suspect they are not).

  • ChrisLChrisL Cambridge, MAMember, Broadie, Moderator, Dev admin

    @dannykwells just a heads up, you don't necessarily need a special-purpose front end to test out server mode, just fire it up and connect to http://<ip>:8000/ in a browser to be redirected to a swagger page that lets you make and query requests from a basic-but-functional UI.

  • danbdanb Member, Broadie ✭✭✭

    I think you will vastly prefer server mode. Could you try it?

    i.e. java -jar cromwell-X.jar server and hit localhost:8000 on your browser.

    Unfortunately call cache diff endpoint is only available via the webservice...

    If you absolutely can't run via server mode, use the --metadata_output flag to have a look at your metadata and ensure your workflow failure mode is being used.

  • dannykwellsdannykwells San FranciscoMember ✭✭

    I realized I could be gettinv config errors at the start - does this look normal as a start up message to you:

    2017-09-19 20:25:19,081 INFO - Running with database db.url = jdbc:mysql://<>:3306/cromwell?useSSL=false
    2017-09-19 20:25:22,478 INFO - Successfully acquired change log lock
    2017-09-19 20:25:24,055 INFO - Reading from cromwell.DATABASECHANGELOG
    2017-09-19 20:25:24,510 INFO - Successfully released change log lock
    2017-09-19 20:25:24,860 WARN - Unrecognized configuration key(s) for Jes: compute-service-account
    2017-09-19 20:25:24,865 WARN - Couldn't find a suitable DSN, defaulting to a Noop one.
    2017-09-19 20:25:24,870 INFO - Using noop to send events.

  • dannykwellsdannykwells San FranciscoMember ✭✭

    Ok, we started server mode (it really is that easy...for people reading this, if you're on Google the only exception is that you have to add a firewall exception for port 8000 for this static ip for your instance (and you need a static IP to begin with))

    The caching seems to be working much better with this here - at least, cromwell is finding jobs that I know I've ran and using those as the cache. So that's a good thing.

    Is the idea then, that if we're going to be doing call caching we should be in server mode?

  • ChrisLChrisL Cambridge, MAMember, Broadie, Moderator, Dev admin

    Server mode has a bunch of advantages, off the top of my head:

    • You can query metadata at any time
    • You can query why your call caches missed
    • More than one person can submit to the same cromwell instance at the same time (i.e. you can share a database but have many workflows running concurrently)
    • Timing diagrams! (eg http://<cromwell_host>:8000/api/workflows/v1/<workflow_id>/timing) - this could be a game changer for you if you're trying to work out what's taking time/resources/cost in your workflows

    One warning - if you've got Cromwell running and the IP address is public, you might want to narrow your firewall down so that only people you know can submit requests to it. You don't want just anybody running jobs as you against your google credit account!

  • dannykwellsdannykwells San FranciscoMember ✭✭

    Hi @ChrisL - those are all great! We found the timing diagrams and they're working wonders. One question that came up from looking at the timing: do you find that, on GCP, that old cromwell jobs that have been cancelled are lingering for hours, and still computing. When I try to abort those jobs, I get the response

    { "status": "error", "message": "Couldn't abort 6f67c7e9-4bc2-40c2-bfea-d54b61e0ad4f because no workflow with that ID is in progress" }
    However, this job does come up when I look at

    GET /api/workflows/{version}/query

    { "id": "6f67c7e9-4bc2-40c2-bfea-d54b61e0ad4f", "name": "variant_calling", "status": "Running", "start": "2017-09-19T20:13:30.218Z" },

    Have you seen it where a job appears as running, but can't be shut down? We'd really like top be able to clear these old jobs out!

Sign In or Register to comment.