Diagnosing call caching behavior

mmahmmah Member, Broadie

I am using Cromwell v27 running in standalone mode on an LSF backend. I am still finishing development on a workflow, so I am restarting workflows that should have man job results cached, including many scattered job results. I am currently using the following caching configuration:

caching {
              duplication-strategy: [
                "soft-link"
              ]
              hashing-strategy: "path"
              check-sibling-md5: false
}

My observations:
1. I need to allocate significantly more resources to the job running Cromwell when trying to use call caching results. With an empty call cache, Cromwell seems to require ~ 0.7 CPUs on my backend. WIth call caching, Cromwell seems to require ~10 CPUs. I am using (reported CPU time / wall time) for this estimation.

  1. Retrieved cached results appear in bursts in the filesystem. There may be ~30 minutes between results appearing when reaching a scatter section of the workflow.

  2. Average call cache retrieval time seems better for single jobs than for scattered jobs.

  3. Call cache retrieval time seems worse for tasks with large inputs, for example, the human reference genome.


I understand the basic concept of call caching: we hash all the inputs, store the hash in a table, and associate the execution directory results with that hash. I do not understand how the caching configuration options affect this process, nor how call cache checking for multiple tasks is done (all parallel? executor pool?).

If hashing paths, not files, why does retrieving scattered alignment results for a large reference take so much longer than retrieving alignment results for a small reference for the same reads?

Is Cromwell trying to run all call cache checking in parallel?

Best Answer

Answers

  • mcovarrmcovarr Cambridge, MAMember, Broadie, Dev

    Where do you have this config set up? The fact that hashing time seems to be a function of input size makes me suspect this is not in the right place and you're getting the default hashing-strategy: "file". Per the reference.conf, the correct place for this should be within the config for the backend you're using:

    backend {
      default = "Local"
      providers {
        Local {
          actor-factory = "cromwell.backend.impl.sfs.config.ConfigBackendLifecycleActorFactory"
          config {
    ...
            filesystems {
              local {
                localization: [
                  "hard-link", "soft-link", "copy"
                ]
    
                caching {
                  # When copying a cached result, what type of file duplication should occur. Attempted in the order listed below:
                  duplication-strategy: [
                    "hard-link", "soft-link", "copy"
                  ]
    
                  # Possible values: file, path
                  # "file" will compute an md5 hash of the file content.
                  # "path" will compute an md5 hash of the file path. This strategy will only be effective if the duplication-strategy (above) is set to "soft-link",
                  # in order to allow for the original file path to be hashed.
                  hashing-strategy: "file"
    
  • mmahmmah Member, Broadie

    I edited the caching section in place:

    backend {
      default = "LSF"
      providers {
        Local {
          actor-factory = "cromwell.backend.impl.sfs.config.ConfigBackendLifecycleActorFactory"
          config {
    
            # Limits the number of concurrent jobs
            #concurrent-job-limit = 5
    
            run-in-background = true
            # `script-epilogue` configures a shell command to run after the execution of every command block.
            #
            # If this value is not set explicitly, the default value is `sync`, equivalent to:
            # script-epilogue = "sync"
            #
            # To turn off the default `sync` behavior set this value to an empty string:
            # script-epilogue = ""
    
            runtime-attributes = """
            String? docker
            String? docker_user
            """
            submit = "/bin/bash ${script}"
            submit-docker = """
            docker run \
              --rm -i \
              ${"--user " + docker_user} \
              -v ${cwd}:${docker_cwd} \
              ${docker} \
              /bin/bash ${script}
            """
    
            # Root directory where Cromwell writes job results.  This directory must be
            # visible and writeable by the Cromwell process as well as the jobs that Cromwell
            # launches.
            root = "cromwell-executions"
    
            filesystems {
              local {
                localization: [
                  "hard-link", "soft-link", "copy"
                ]
    
                caching {
                  # When copying a cached result, what type of file duplication should occur. Attempted in the order listed below:
                  duplication-strategy: [
                    "soft-link"
                  ]
    
                  # Possible values: file, path
                  # "file" will compute an md5 hash of the file content.
                  # "path" will compute an md5 hash of the file path. This strategy will only be effective if the duplication-strategy (above) is set to "soft-link",
                  # in order to allow for the original file path to be hashed.
                  hashing-strategy: "path"
    
                  # When true, will check if a sibling file with the same name and the .md5 extension exists, and if it does, use the content of this file as a hash.
                  # If false or the md5 does not exist, will proceed with the above-defined hashing strategy.
                  check-sibling-md5: false
                }
              }
            }
          }
        }
    
        LSF {
          actor-factory = "cromwell.backend.impl.sfs.config.ConfigBackendLifecycleActorFactory"
          config {
            concurrent-job-limit = 100
            script-epilogue = ""
            runtime-attributes = """
            Int runtime_minutes = 720
            Int cpus = 2
            Int requested_memory_mb_per_core = 8000
            String queue = "short"
            """
    
            submit = """
            bsub -J ${job_name} -cwd ${cwd} -o ${out} -e ${err} \
            -W ${runtime_minutes} -q ${queue} \
            -n ${cpus} \
            -R "rusage[mem=${requested_memory_mb_per_core}]" \
            /bin/bash ${script}
            """
            kill = "bkill ${job_id}"
            check-alive = "bjobs ${job_id}"
            job-id-regex = "Job <(\\d+)>.*"
          }
        }
    
      }
    }
    
  • mmahmmah Member, Broadie

    OK. That's a silly mistake then. I need to copy the filesystem portion within local to LSF?

  • mcovarrmcovarr Cambridge, MAMember, Broadie, Dev

    Yes, hopefully that should give you a big boost in call caching performance. :smile:

Sign In or Register to comment.