newbie questions

Hi

I'm just getting started with WDL and I have some questions:

  • What is the preferred input type for directories? Are they String or File types?
    is it possible to "import" tasks from another workflow, or some general 'tasks' file, so the same task can be used in multiple workflows?

  • Does WDL support unix pipes? I know it's possible to use the pipe "|" inside a task, but is there any support for pipes between tasks.

  • What is the preferred approach for commands that are normally piped?
    is it better to do
command{
  cmd1 | cmd2 | cmd3
}

or to separate all commands into different tasks and let them write to stdout, using that as input for the next task?

  • when using the docker runtime, how can we specify volumes to be mounted? This is the case for pretty much every tool that requires some kind of reference data. For example, when using a biocontainers image for bowtie2 you'd need to mount a directory with reference data for the mapping, next to a directory with the input data.

Thanks a lot.
M

Tagged:

Best Answer

Answers

  • RuchiRuchi Member, Broadie, Moderator, Dev admin

    Hello,

    What is the preferred input type for directories? Are they String or File types?
    is it possible to "import" tasks from another workflow, or some general 'tasks' file, so the same task can be used in multiple workflows?

    There isn't proper directory support yet but one solution that should work is to zip the contents of a directory into a file. Then, you can declare a File type as the task input, and unzip the file in the command block to access all the files in that directory.

    Does WDL support unix pipes? I know it's possible to use the pipe "|" inside a task, but is there any support for pipes between tasks.

    I may have misunderstood your question, but one can connect tasks together by specifying the output of one task as the input of another. Does that help?

    What is the preferred approach for commands that are normally piped?

    If cmd1, cmd2 and cmd3 are each individually reusable for other workflows, it may make sense to break them apart. If the resource requirements (such as memory, cpu, disk size) for each cmd are very different then its also good to break them apart so you can optimize resource usage for each cmd individually. In addition, if they are all separate tasks, and lets say if cmd3 fails for some transient reason, if you have call caching enabled then when the workflow is re-run, the outputs for cmd1 and cmd2 don't have to be re-computed, they are simply copied over from the first run.

    when using the docker runtime, how can we specify volumes to be mounted? This is the case for pretty much every tool that requires some kind of reference data. For example, when using a biocontainers image for bowtie2 you'd need to mount a directory with reference data for the mapping, next to a directory with the input data.

    You can either unzip the reference bundle to a designated directory, or specify each reference file as a task input and in the command link all reference files to be under a shared directory.

    Hope this helps!

  • matdmsetmatdmset GhentMember

    Hi!

    Thanks for the reply!

    There isn't proper directory support yet but one solution that should work is to zip the contents of a directory into a file. Then, you can declare a File type as the task input, and unzip the file in the command block to access all the files in that directory.

    I can see this approach working for docker images, and directories that can be pulled from S3 or something, but it seems quite ineffecient on a shared filesystem, where the directory is already present and readable.

    Does WDL support unix pipes? I know it's possible to use the pipe "|" inside a task, but is there any support for pipes between tasks.

    I may have misunderstood your question, but one can connect tasks together by specifying the output of one task as the input of another. Does that help?

    I'm aware it's possible to "pipe" the stdout of one task to another, but if I'm not mistaken, this still means intermediate files are written first, whereas when using a pipe, the results just stay in memory without needing any extra storage. I'm trying to find the best way to translate a piped command to WDL, without taking a huge storage hit (which will also be a bottleneck in the process).

    What is the preferred approach for commands that are normally piped?

    If cmd1, cmd2 and cmd3 are each individually reusable for other workflows, it may make sense to break them apart. If the resource requirements (such as memory, cpu, disk size) for each cmd are very different then its also good to break them apart so you can optimize resource usage for each cmd individually. In addition, if they are all separate tasks, and lets say if cmd3 fails for some transient reason, if you have call caching enabled then when the workflow is re-run, the outputs for cmd1 and cmd2 don't have to be re-computed, they are simply copied over from the first run.

    Thank for the tip on call caching, hadn't spotted that functionality!

    when using the docker runtime, how can we specify volumes to be mounted? This is the case for pretty much every tool that requires some kind of reference data. For example, when using a biocontainers image for bowtie2 you'd need to mount a directory with reference data for the mapping, next to a directory with the input data.

    You can either unzip the reference bundle to a designated directory, or specify each reference file as a task input and in the command link all reference files to be under a shared directory.

    Is there a resource where I can find some more in depth info on this? I'd like to study up before asking any more obvious questions.

    Thanks!
    M

  • matdmsetmatdmset GhentMember

    Allright, thanks for the info!
    Cheers
    M

  • This is my solution for using folders as input with cromwell: https://gatkforums.broadinstitute.org/wdl/discussion/comment/40701/#Comment_40701

    Since there is no zipping or copying (only hardlinking), it is fast even for large or deeply nested folders.

Sign In or Register to comment.