Heads up:
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.

An array of files as an entity attribute in data model

akmanningakmanning United StatesMember
edited July 2017 in Ask the FireCloud Team

I didn't see this in the documentation, but I might have missed it.

Is it possible to specify an array of files as a data model entity attribute?

I have a genetics data set, with participants and samples. At the sample set level, our genotyping files (VCF files) are split by chromosome, but I want to design a WDL that inputs an array of VCF files (one file per chromosome). I would like to have a row in my sample set which has an entity called "VCF_file_array" which has an entity attribute of "[gs://PATH/chr1.vcf, gs://PATH/chr2.vcf]" etc.

I would then like the method_configuration to assign the output of a WDL, an array of files, to a new entity, for example, if I am writing a conversion script from VCF to GDS file: GDS_file_array entity with "[gs://PATH/chr1.gds, gs://PATH/chr2.gds]" as the entity attribute.

Example WDLs and method configurations would be helpful.

Thanks,
Alisa

Answers

  • KateNKateN Cambridge, MAMember, Broadie, Moderator admin

    Hello Alisa,

    Unfortunately, we don't yet have a great way to add arrays to your data model as an entity attribute. Right now we recommend that you write a WDL whose output is that array of files you want. Then in your method config, you can assign that output to the entity attribute you want.

    There is also a way to do it through the Orchestration API, but you can only update one entity at a time (using the update_entity endpoint), and the syntax for it is rather complex. We have a ticket in our queue to write documentation for using this properly, but right now I have nothing to offer you on this front.

    In the future we have plans to make this more UI-friendly, like we allow you to do with workspace attributes (just type it in) or potentially allow you to upload a JSON-formatted file that will populate your entity attribute with the Array[File] you want. Unfortunately, something like this won't be coming until potentially 2018. In the meantime, your best option would be to write a WDL.

  • akmanningakmanning United StatesMember
    edited August 2017

    Hi,
    Please suggest a better WDL, but this is what I am thinking. I didn't know if I could leave out the command section of the WDL.

    task getarray {
        Array[File] gdsfilesin
    
        command {
            ls -lh ${sep = ' ' gdsfilesin}
        }
    
        output {
            Array[File] gdsfilesout = gdsfilesin}
    
         runtime {
               docker: "[email protected]:84c334414e2bfdcae99509a6add166bbb4fa4041dc3fa6af08046a66fed3005f"
         }
    
    }
    
    workflow gdsfiles {
    
        call getarray
    }
    

    It's kind of ridiculous that this took 15+ minutes to run. Suggestions for tiny docker images and/or other time-savers would be appreciated.

    Post edited by akmanning on
  • KateNKateN Cambridge, MAMember, Broadie, Moderator admin
    edited August 2017

    To answer your first question, you can just have a blank command section in your task; That may save you a bit of time.

    To your second question, the smallest docker image I could find is openjdk:alpine. It just contains a java image, and although you don't need it for your task, it should be very lightweight.

    Edit: Don't use that docker image, see my comment below for an updated one.

    Post edited by KateN on
  • RuchiRuchi Member, Broadie, Moderator, Dev admin
    edited August 2017

    Hey @akmanning,

    If you're just trying to pipe an array of inputs back as an array of outputs, can I recommend you try re-writing your WDL as:

    workflow gdsfiles {
       Array[File] gdsfilesin
    
       output {
          Array[File] gds_files = gdsfilesin
       }
    }
    

    I tested this on a mock sample with a mock WDL which took about ~1 minute to run.

    workflow wf_output {
      Array[String] in_arr = ["a","b","c"]
    
      output {
        Array[String] out_arr = in_arr
      }
    }
    

    My method config:
    wf_output.out_arr: (Array[String])this.example_attr

    This is what the data model looked like after.

  • KateNKateN Cambridge, MAMember, Broadie, Moderator admin
    edited August 2017

    @Ruchi Were you able to run a WDL just like you wrote in FireCloud? I was under the impression that both an empty command block command { } and a docker image runtime { docker: "dockerimage"} were required?

    If a docker image is required, then I have come back to add that you will need to use a docker image that has bash installed, instead of the one I mentioned above. You can use this one which has java & bash:
    https://hub.docker.com/r/joepjoosten/openjdk-alpine-bash/~/dockerfile/

    Or this one which does not have java, but does have bash:
    https://hub.docker.com/r/frolvlad/alpine-bash/

  • RuchiRuchi Member, Broadie, Moderator, Dev admin

    Hey Kate,

    I shared the exact method I ran (which is the exact same workflow as the one posted above with the name wf_output)
    https://portal.firecloud.org/#methods/m/Test-Concepts/pipe_outputs/1

    Submission information:
    https://portal.firecloud.org/#workspaces/broad-firecloud-dsde/RM_Playground/monitor/07f7499a-2623-40ba-9824-b31590955f25/ad9e0ba5-8ab0-40b3-a7c0-ec1349c1f2cd

    The reason this works is because Cromwell doesn't need to run a task to create workflow level outputs, if you want to stream your workflow inputs back out as your workflow outputs, it can do that without running any jobs at all.

  • KateNKateN Cambridge, MAMember, Broadie, Moderator admin

    Oh, fantastic! I had no idea that trick existed. @akmanning, Ruchi's workflow should minimize overhead for you much more effectively than my earlier suggestions.

Sign In or Register to comment.