Call caching

If I understand correctly, call-caching works by copying the files from a previous workflow provided that the script, input are the exact same. But since WDL is copying file it is taking up too much disk space. Is there a way to make it do hard or soft linking instead?

Best Answer

Answers

  • jsotojsoto Broad InstituteMember, Broadie, Dev

    Hey @awachs , there is a section in the cromwell config file where you can tell cromwell to soft/hard link a file instead of copying when a cache hit occurs.

            filesystems {
              local {
                localization: [
                  "hard-link", "soft-link", "copy"
                ]
    
                caching {
                  # When copying a cached result, what type of file duplication should occur. Attempted in the order listed below:
                  *****duplication-strategy: [
                    "hard-link", "soft-link", "copy"
                  ]
                }
              }
            }
    

    The part that is important to is the the part after the ***** characters. This tells cromwell that for cache hits, first try to hard-link, then soft-link, then copy. You can find more information here and here

  • awacsawacs Member

    @jsoto said:
    Hey @awachs , there is a section in the cromwell config file where you can tell cromwell to soft/hard link a file instead of copying when a cache hit occurs.

            filesystems {
              local {
                localization: [
                  "hard-link", "soft-link", "copy"
                ]
    
                caching {
                  # When copying a cached result, what type of file duplication should occur. Attempted in the order listed below:
                  *****duplication-strategy: [
                    "hard-link", "soft-link", "copy"
                  ]
                }
              }
            }
    

    The part that is important to is the the part after the ***** characters. This tells cromwell that for cache hits, first try to hard-link, then soft-link, then copy. You can find more information here and here

    That needs to be specified per filesystem correct? Like Local vs LSF, I have to specify based on which filesystem I'm using?

  • awacsawacs Member

    Can you also explain what hashing-strategy and check sibling md5 means?

  • RuchiRuchi Member, Broadie, Moderator, Dev admin

    @awacs Just to clarify one important point, the filesystems stanza posted above does need to be declared for each backend listed under backend.providers. It can be uniquely configured for each backend, such as local or LSF or SGE. Thanks!

  • I would also like to use call-caching, but for a SGE backend. How can I tell if call-caching is working properly? I have copied that filesystems stanza into the SGE backend section. To test this, I submitted a workflow with several tasks and then used qdel when I saw that the second task was running. However, upon submitting this workflow again, it appears that it submitted the first task, even though I would think this would have been cached.

  • RuchiRuchi Member, Broadie, Moderator, Dev admin

    @jfiksel Another important requirement for Call Caching is that Cromwell must be configured to talk to a mysql database. If that is already the case, then can you confirm you have call caching enabled by searching for this stanza in your config:

    call-caching {
      # Allows re-use of existing results for jobs you've already run
      # (default: false)
      enabled = true
    }
    

    Thank you!

  • Thanks @Ruchi , I'll try this out. I will have to get in contact with my system administrators about setting up a MySQL database. If/when I do set this up, how exactly does call caching work in cromwell? Will it re-run all tasks where it has not detected a successful run? Or will it only re-run tasks where the task command has changed?

  • RuchiRuchi Member, Broadie, Moderator, Dev admin

    Hey @jfiksel,

    The way call caching works is that if a task has previously succeeded, Cromwell will copy over the output files from the run it cached to instead of re-running it. It will re-run the task if the command or inputs or runtime parameters have changed. More docs found here.

  • shleeshlee CambridgeMember, Broadie, Moderator admin
    edited February 16

    Hi,

    How can I disable call-caching? Actually, how can I view the config settings of my local cromwell?

    It appears that if I change a workflow's optional parameters then call-caching still happens and my results are incorrect. I am using v4.0.1.1 WDL scripts for Somatic CNV cnv_somatic_pair_workflow.wdl. My original runs used mostly default parameters and generated coverage counts in the default HDF5 format. The run in question adds the following optional parameters to the JSON inputs file. All other parameters are identical.

      "CNVSomaticPairWorkflow.CollectCountsTumor.output_format": "TSV",
      "CNVSomaticPairWorkflow.CollectCountsNormal.output_format": "TSV",
      "CNVSomaticPairWorkflow.PlotModeledSegmentsNormal.minimum_contig_length": "57227415",
      "CNVSomaticPairWorkflow.PlotDenoisedCopyRatiosTumor.minimum_contig_length": "57227415",
      "CNVSomaticPairWorkflow.PlotModeledSegmentsTumor.minimum_contig_length": "57227415",
      "CNVSomaticPairWorkflow.PlotDenoisedCopyRatiosNormal.minimum_contig_length": "57227415"
    

    However, it appears that my results are in HDF5 format and plotting still includes contigs smaller than that specified. I need to redo these runs without call-caching. My current cromwell command structure is as follows:

    java -Dbackend.provider.Local.config.filesystems.gcs.auth=application-default -jar /home/shlee/cromwell-30.2.jar run cnv_somatic_pair_workflow.wdl --inputs cnv_somatic_pair_workflow_ponM.json
    [2018-02-16 20:24:40,95] [info] Running with database db.url = jdbc:hsqldb:mem:a5fdb02c-543a-413e-a354-080ea4f060c0;shutdown=false;hsqldb.tx=mvcc
    ...
    

    Thanks.

  • shleeshlee CambridgeMember, Broadie, Moderator admin
    edited February 16

    FYI, adding -Dcall-caching.enabled=false to my cromwell run still gives HDF5 results. I am running Cromwell v30.2 on a Google Compute VM, using sudo -i and also tmux.

  • ThibThib CambridgeMember, Broadie, Dev ✭✭

    Judging by how you're running Cromwell, except for the addition of the gcs filesystem to the local backend you are using the default configuration.
    Therefore you're using the default in-memory database which will vanish as soon as the workflow completes.
    This means that your call caching configuration does not matter as call caching needs a persistence database to work.
    I suspect the issue is with your workflow. You said you added new inputs to your JSON, are those inputs being used in the WDL ?

  • shleeshlee CambridgeMember, Broadie, Moderator admin

    @Thib, I think at one point I played around with a config file and config settings that did enable call-caching. I have at one point used -Dcall-caching.enabled=true and a config file with:

    call-caching {
      # Allows re-use of existing results for jobs you've already run
      # (default: false)
      enabled = true
    

    As you say, I also assume such settings do not persist across runs.

    The inputs are identical to the previous JSON. Again, I did not change any of the required inputs. I only added some optional parameters that I wanted to change from the default. I am using, unmodified, a 30-page set of workflow WDLs straight from the GATK4 repo.

  • shleeshlee CambridgeMember, Broadie, Moderator admin

    I will take my question to the GATK4 repo and let you know if I resolve my issue.

  • shleeshlee CambridgeMember, Broadie, Moderator admin

    I have learned the source of mismatch from a developer. The WDL I am running calls on another WDL (sub workflow) for tasks and these task-level parameters cannot be changed via the JSON inputs file. There will be no warnings for the ignored parameters. The version of cnv_somatic_pair_workflow.wdl I am running is v4.0.1.1. In the next version, v4.0.1.2, these parameters have been reconfigured so that they can be changed by the JSON inputs file.

  • ChrisLChrisL Cambridge, MAMember, Broadie, Moderator, Dev admin

    @shlee what Cromwell are you using? I thought we fixed the "unforwarded subworkflow inputs not getting reported" bug in Cromwell 30 but if not we should address it.

  • shleeshlee CambridgeMember, Broadie, Moderator admin

    @ChrisL, I am using Cromwell v30.2.

  • ChrisLChrisL Cambridge, MAMember, Broadie, Moderator, Dev admin

    Thanks @shlee - Does the workflow run to completion ignoring your input values entirely? That does sound like a bug to me - could you raise it with a simple example over in the Cromwell repo?

  • shleeshlee CambridgeMember, Broadie, Moderator admin
    edited February 23

    Yes, the workflow runs to completion. And sure, I can put in an issue ticket in the Cromwell repository.

    P.S. I've put in the ticket at https://github.com/broadinstitute/cromwell/issues/3316.

Sign In or Register to comment.