Holiday Notice:
The Frontline Support team will be slow to respond December 17-18 due to an institute-wide retreat and offline December 22- January 1, while the institute is closed. Thank you for your patience during these next few weeks. Happy Holidays!

Recommended way of shipping custom scripts with WDL pipelines?

Hi all,

Is there a recommended way to ship a custom script with a WDL pipeline?

I'm writing a WDL workflow executing a step running R code using Rscript. The CWL version of this pipeline was doing something like this:

baseCommand: Rscript

requirements:
  - class: InitialWorkDirRequirement
    listing:
      - $(inputs.script)

inputs:
  script:
    type: File
    inputBinding:
      position: 1
    default:
      class: File
      location: MyScript.R

This allows me to ship my R code right next to my .cwl file. Is there something similar with WDL? I would like to avoid hard-coding paths or assuming the script is in the user $PATH.

Thanks

Tagged:

Answers

  • Redmar_van_den_BergRedmar_van_den_Berg Member ✭✭

    I would use the docker runtime attribute for this. You can create a docker image that has your script, and publish it in a public (or private) location such as dockerhub. If you specify a docker runtime, cromwell will automatically pull down the correct image and use that to run the command.

    http://cromwell.readthedocs.io/en/develop/search.html?q=docker

  • CarlosBorrotoCarlosBorroto Member ✭✭

    Hi @Redmar_van_den_Berg, our internal HPC environment doesn't support docker. I agree, things would be much simpler if it did.

    Any other option beside docker?

  • ChrisLChrisL Cambridge, MAMember, Broadie, Moderator, Dev admin

    Based on your CWL file, I would translate that as roughly:

    task run_rscript {
      File r_script
      command {
        Rscript ${r_script}
      }
    }
    

    I'm not 100% sure why the InitialWorkDirRequirement was needed in this case, but if it's to guarantee the script starts in the same directory as the PWD for execution, you can always do this to move it before running Rscript:

    task run_rscript {
      File r_script
      command {
        mv ${r_script} script.R
        Rscript script.R
      }
    }
    
  • ChrisLChrisL Cambridge, MAMember, Broadie, Moderator, Dev admin

    PS: I just noticed you also specify a default value as "MyScript.R", which would end up looking like this:

    task run_rscript {
      File r_script = ".../path/to/default.R"
      command {
        mv ${r_script} script.R
        Rscript script.R
      }
    }
    

    However... personally I'd strongly encourage you to avoid this kind of pattern! It prevents portability since:

    • It can't possible work on the cloud.
    • It means if you share your workflow with me, I have to just know that I need to stage a file there.
    • If you submit this workflow to a server instead of running it locally, that server won't have the file available at the same path as you.

    Therefore, I'd strongly encourage you to get into the habit of making that script file a workflow input and letting the engine work out how to make sure it's available to each task that needs it.

  • CarlosBorrotoCarlosBorroto Member ✭✭

    Hi @ChrisL,

    The workaround I'm currently using if basically what you are describing, adding the script as a input of my workflow. However I would argue that is less portable. Now the pipeline user needs to know where the pipeline code will be deployed in order to provide the path to this script. In the CWL code, we are able to say "just look for a file right next to you." Assuming the pipeline is deployed as a whole, I think this pattern is valid. I wonder if "import" could be extended to work with more than just "wdl" files.

    BTW, this pattern is described as best practice for CWL pipelines here:
    "CommandLineTools wrapping custom scripts should represent the script as an input parameter with the script file as a default value. Use secondaryFiles for scripts that consist of multiple files."
    https://doc.arvados.org/user/cwl/cwl-style.html

    Thanks for the discussion. I'm not surprised this hasn't come up before as for most people docker would make this irrelevant.

  • ChrisLChrisL Cambridge, MAMember, Broadie, Moderator, Dev admin

    Hey @CarlosBorroto of course you're welcome to do things like that if you want - my last WDL snippet should work just like the CWL version (including a default input). You're right that usually I would advise people to make a docker image so that the task is self contained as much as possible.

    Having a script as an input (even if it has a default) opens the door to people accidentally submitting an incorrect script and causing all kinds of problems (just because it has a default doesn't mean people can't override it!)


    For a bit of context - a lot of our mindset is based on submitting workflows to a single, shared, hosted Cromwell service using a cloud backend. For us, we would almost always want to host the script in cloud storage and refer to it by URL in the workflow inputs.

    This often means that when we share workflows, we also often share "sample inputs" in publicly hosted locations, so that people can substitute in inputs they need but don't have to go and re-find all of the mundane things like scripts, but also reference inputs that don't change very much or at all, run to run.

    And relying on the script "just being next to us" doesn't really work for us in a production environment where the script author, the execution engine running it, and the VM doing the actual processing are all in completely different places. It might be possible to copy the script around between the various actors but that's a lot of unnecessary file copying - much better to have the script hosted near where the execution is going to happen and only have to copy it once (from cloud storage down into the job VM - all on the same cloud)

Sign In or Register to comment.