To celebrate the release of GATK 4.0, we are giving away free credits for running the GATK4 Best Practices pipelines in FireCloud, our secure online analysis portal. It’s first come first serve, so sign up now to claim your free credits worth $250. Sponsored by Google Cloud. Learn more at https://software.broadinstitute.org/firecloud/documentation/freecredits

What is the best way to specify a file output, when the file name is not known ahead of the run

birgerbirger Member, Broadie, CGA-mod

I am writing a task (and workflow) that takes as input a uuid and retrieves from a repository the file associated with the uuid. While I could determine ahead of time the name of the file, I'd rather not. Instead, I want the task to retrieve the file, and then I want to specify that file as an output. I'm experimenting with the use of the glob function to get an array of files, and then indexing into the resulting File Array (don't know yet whether this will work). Regardless, is there a better way to do this?

Tagged:

Best Answer

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie
    Hi @birger, I don't think we have anything preset to do what you're describing, but perhaps I'm misunderstanding. Could you please provide a hypothetical case worked out to illustrate what you want to do?
  • birgerbirger Member, Broadie, CGA-mod

    I'll describe the use case in greater detail (fyi...this is not hypothetical, it is an actual use case we are supporting for the NCI Cloud Piliot project).

    1. Analyst Uses the GDC Data Portal (https://gdc-portal.nci.nih.gov/) to create a manifest of desired files
      a. Specify case/biospecimen filters
      b. Specify file filters
      c. Download manifest (for use with the GDC data transfer tool)
    2. Run Client Tool to create FireCloud data model entity load files (“TSV” files) from the GDC manifest
    3. Go to FireCloud, create a new workspace, and populate workspace data model using load files generated in step 2
      4.Run FireCloud workflow to retrieve files from GDC and copy then to Google Cloud Storage

    I have steps 1 - 3 implemented. I am currently working on step 4, and that is what this forum posting is about. The single-task WDL workflow takes as input file UUIDs obtained from the GDC. The task uses the GDC's data transfer tool to download the files identified by the UUID, to the task's VM (docker container running the file retrieval task). Those files have names, for example: nationwidechildrens.org_clinical.TCGA-AF-3913.xml. I want to preserve those filenames and specify the files as a task- and workflow-level outputs (so it will be both delocalized to cloud storage AND written back to the FireCloud data model). The crux of the issue is that the file UUID is the input to the task and I want the named file to be the output of the task. This is different from the scenarios presented in the documentation and online WDL examples, in those examples, the output file names can be constructed from task input strings.

    I have some workarounds for now: I include the filename, alongside the file UUID, in the data model. This should not be necessary, however, and it just clutters our data model.

  • birgerbirger Member, Broadie, CGA-mod

    @Geraldine_VdAuwera, I discussed this with Ruchi during FireHose/FireCloud/WDL office hours. It sounds like WDL currently doesn't support my use case. I was told to wait until Cromwell 23, which fixes globs. But it was still unclear to me whether the WDL language itself supports this. Maybe @kcibul can comment.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie
    @KateN and @KateVoss Can you please follow up on this one?
  • birgerbirger Member, Broadie, CGA-mod

    Not really. I have a work around, but I have no idea whether I will be able to do what I want in the future (once cromwell 23 is in place).

Sign In or Register to comment.