An array of files as an entity attribute in data model

akmanningakmanning United StatesMember
edited July 2017 in Ask the FireCloud Team

I didn't see this in the documentation, but I might have missed it.

Is it possible to specify an array of files as a data model entity attribute?

I have a genetics data set, with participants and samples. At the sample set level, our genotyping files (VCF files) are split by chromosome, but I want to design a WDL that inputs an array of VCF files (one file per chromosome). I would like to have a row in my sample set which has an entity called "VCF_file_array" which has an entity attribute of "[gs://PATH/chr1.vcf, gs://PATH/chr2.vcf]" etc.

I would then like the method_configuration to assign the output of a WDL, an array of files, to a new entity, for example, if I am writing a conversion script from VCF to GDS file: GDS_file_array entity with "[gs://PATH/chr1.gds, gs://PATH/chr2.gds]" as the entity attribute.

Example WDLs and method configurations would be helpful.

Thanks,
Alisa

Answers

  • KateNKateN Cambridge, MAMember, Broadie, Moderator admin

    Hello Alisa,

    Unfortunately, we don't yet have a great way to add arrays to your data model as an entity attribute. Right now we recommend that you write a WDL whose output is that array of files you want. Then in your method config, you can assign that output to the entity attribute you want.

    There is also a way to do it through the Orchestration API, but you can only update one entity at a time (using the update_entity endpoint), and the syntax for it is rather complex. We have a ticket in our queue to write documentation for using this properly, but right now I have nothing to offer you on this front.

    In the future we have plans to make this more UI-friendly, like we allow you to do with workspace attributes (just type it in) or potentially allow you to upload a JSON-formatted file that will populate your entity attribute with the Array[File] you want. Unfortunately, something like this won't be coming until potentially 2018. In the meantime, your best option would be to write a WDL.

  • akmanningakmanning United StatesMember
    edited August 2017

    Hi,
    Please suggest a better WDL, but this is what I am thinking. I didn't know if I could leave out the command section of the WDL.

    task getarray {
        Array[File] gdsfilesin
    
        command {
            ls -lh ${sep = ' ' gdsfilesin}
        }
    
        output {
            Array[File] gdsfilesout = gdsfilesin}
    
         runtime {
               docker: "[email protected]:84c334414e2bfdcae99509a6add166bbb4fa4041dc3fa6af08046a66fed3005f"
         }
    
    }
    
    workflow gdsfiles {
    
        call getarray
    }
    

    It's kind of ridiculous that this took 15+ minutes to run. Suggestions for tiny docker images and/or other time-savers would be appreciated.

    Post edited by akmanning on
  • KateNKateN Cambridge, MAMember, Broadie, Moderator admin
    edited August 2017

    To answer your first question, you can just have a blank command section in your task; That may save you a bit of time.

    To your second question, the smallest docker image I could find is openjdk:alpine. It just contains a java image, and although you don't need it for your task, it should be very lightweight.

    Edit: Don't use that docker image, see my comment below for an updated one.

    Post edited by KateN on
  • RuchiRuchi Member, Broadie, Moderator, Dev admin
    edited August 2017

    Hey @akmanning,

    If you're just trying to pipe an array of inputs back as an array of outputs, can I recommend you try re-writing your WDL as:

    workflow gdsfiles {
       Array[File] gdsfilesin
    
       output {
          Array[File] gds_files = gdsfilesin
       }
    }
    

    I tested this on a mock sample with a mock WDL which took about ~1 minute to run.

    workflow wf_output {
      Array[String] in_arr = ["a","b","c"]
    
      output {
        Array[String] out_arr = in_arr
      }
    }
    

    My method config:
    wf_output.out_arr: (Array[String])this.example_attr

    This is what the data model looked like after.

  • KateNKateN Cambridge, MAMember, Broadie, Moderator admin
    edited August 2017

    @Ruchi Were you able to run a WDL just like you wrote in FireCloud? I was under the impression that both an empty command block command { } and a docker image runtime { docker: "dockerimage"} were required?

    If a docker image is required, then I have come back to add that you will need to use a docker image that has bash installed, instead of the one I mentioned above. You can use this one which has java & bash:
    https://hub.docker.com/r/joepjoosten/openjdk-alpine-bash/~/dockerfile/

    Or this one which does not have java, but does have bash:
    https://hub.docker.com/r/frolvlad/alpine-bash/

  • RuchiRuchi Member, Broadie, Moderator, Dev admin

    Hey Kate,

    I shared the exact method I ran (which is the exact same workflow as the one posted above with the name wf_output)
    https://portal.firecloud.org/#methods/m/Test-Concepts/pipe_outputs/1

    Submission information:
    https://portal.firecloud.org/#workspaces/broad-firecloud-dsde/RM_Playground/monitor/07f7499a-2623-40ba-9824-b31590955f25/ad9e0ba5-8ab0-40b3-a7c0-ec1349c1f2cd

    The reason this works is because Cromwell doesn't need to run a task to create workflow level outputs, if you want to stream your workflow inputs back out as your workflow outputs, it can do that without running any jobs at all.

  • KateNKateN Cambridge, MAMember, Broadie, Moderator admin

    Oh, fantastic! I had no idea that trick existed. @akmanning, Ruchi's workflow should minimize overhead for you much more effectively than my earlier suggestions.

Sign In or Register to comment.