Latest Release: 03/12/19
Release Notes can be found here.

Setting (extra) workflow options

shbriefshbrief New YorkMember
edited February 21 in Ask the FireCloud Team

Hi! It's not clear to me what kind of Cromwell workflow options are available in FC and how/where to add JSON. I'm specifically interested in:

1) (Global) Runtime Attributes
Can I set global attribute options? My workflow have multiple tasks using same docker images at this point - I plan to expand my workflow with tasks using different docker images. If I add (same) runtime attributes under each task, will they build the same images repeatedly? (My image is over 4GB, FYI.) Also, what happens if my workflow has tasks using different runtime environments?

2) Output Copying
I'd like to save outputs in my own GCS bucket. Is it possible? If so, how/where can I define the 'final_workflow_outputs_dir' option?

Thanks!
Sehyun

Best Answers

Answers

  • SChaluvadiSChaluvadi Member, Broadie, Moderator admin

    @shbrief
    Sorry for the delay! I am looking for some information and will update you soon!

  • SChaluvadiSChaluvadi Member, Broadie, Moderator admin
    edited February 22

    @shbrief
    Were you able to take a look at this documentation for Cromwell Workflow Options? It has all the different options that you asked about above as well as examples of how you would implement them to your JSON. Please feel free to correct me if I am wrong and you have seen the documentation but the information was still unclear- I can find more explanations :)

  • shbriefshbrief New YorkMember

    @SChaluvadi
    Yes, I checked that document before, but while checking it again, just found a couple things I'd like to try with 'output copying' part now. :) However, runtime attributes part is still not clear...

    For example, if I have three tasks in my workflow, where task_1&3 use Docker_A and 2 uses Docker_B, runtime 1 and 3 will be same or independent? (e.g. if I install some tools in task 1, will they be available in task 3 or not?)

  • shbriefshbrief New YorkMember
    edited February 22

    @SChaluvadi
    So... Docker_A is pulled only once in my workflow, used both in Task 1 and Task 3, and then removed only after the workflow is finished, right?

  • shbriefshbrief New YorkMember

    @SChaluvadi
    Sorry, one more quick question... If Task 1 and Task 2 use Docker_A and Task 3 uses Docker_B, Docker_A will be still pulled twice or just once?

  • SChaluvadiSChaluvadi Member, Broadie, Moderator admin
    edited February 23

    @shbrief
    Each time a Task performs a function, it is independent of the other Tasks in your workflow. The Workflow block is where you can define a variable that you are going to use in multiple Tasks.

    So, in your case you can define Docker_A and Docker_B in your Workflow block - this way you can use Docker_A in both Tasks. Any time a Task uses one of the dockers, it will pull the image in each time regardless of task or docker. Docker_A in your example, will be pulled twice.

  • shbriefshbrief New YorkMember
    edited February 26

    @SChaluvadi
    I've researched cromwell output options and I don't think there is a way to provide options.json in FireCloud - all the output seems to be saved in a GCS bucket linked to the workspace.

    So I'm trying to copy and paste the output to the different GSC bucket by adding an additional task, copy (https://github.com/broadinstitute/cromwell/issues/1641#issuecomment-345849355). I'm trying this in google/cloud-sdk:alpine runtime.

    Because the output files are imbedded in a complex directory hierarchy, I'm wondering whether there is any systemic way to extract gs url of output files.

    Please let me know if you have any suggestion/correction on any part above. Thanks! :)

  • bshifawbshifaw moonMember, Broadie, Moderator admin
    edited February 27

    Hi,

    You might be able to create a task in your workflow that runs at the end with a gsutil command to copy your files over.

    task copy output{
        String output_file1
        String output_directory2
        String preferred_bucket
    
        command {
            gsutil cp ${output_file1} ${preferred_bucket}
            gsutil cp ${output_directory2} ${preferred_bucket}
        }
        Runtime{
          docker: (whatever docker that as gsutil, perhaps gatk)
        }
    }
    

    the caveat being it would make the workflow unportable should you attempt to run it, or share with somebody else, using some other (non-GCS) environment

  • shbriefshbrief New YorkMember
    edited February 27

    @SChaluvadi
    @bshifaw
    As I mentioned in my previous comment, adding 'copying' task in my workflow is what I've been trying and my question is whether there is any systemic way to extract gs url of my output.

    For example, I know the bucket linked to my workspace (gs://fc-secure-0893eb66-fffa-4cc2-a919-5c803249c3b9). But the actual output I want to copy is embedded in like this:
    gs://fc-secure-0893eb66-fffa-4cc2-a919-5c803249c3b9/b87e1b2d-30e0-417f-b092-980e3e21e716/preprocess_bed/716593a7-aa0c-4c57-b8e8-806d420f3d67/call-IntervalFile/whole_exome_agilent_1.1_gcgene.txt. Also, this info is available to me only after the run is done, meaning that I can't hardcode the path in my workflow.

    Also as I mentioned above, I'm using google/cloud-sdk:alpine docker image to use gsutil. I don't think gatk docker has gsutil in it. (Correct me if I'm wrong.)

  • shbriefshbrief New YorkMember

    @bshifaw
    Thanks! Setting the variable as String works! :)

Sign In or Register to comment.