How to delocalize a file whose name is in a file?

dparkdpark Member
edited November 2017 in Ask the Cromwell + WDL Team

Hi,

Is there a way to specify an output file whose name is in a text file? My understanding of WDL and type coercion led me to believe that I could just set a File output to read_string("text_file_with_filename_in_it.txt") (or read_lines for Array[File]). But that doesn't seem to work--it comes back as a string instead of a delocalized File (and the file itself doesn't get delocalized).

Minimal working example WDL below. I've commented out a number of output lines that cause errors. The following text below should complete with success, but not produce the expected result--specifically if you look at the output at the bottom, the variable test_flow.test_extract.flowcell_toplevel_txt is set to a bunch of strings, not a bunch of GCS URIs, and none of those files appear in my bucket after it completes.

task test_extract {
  File tarball

  command {
    set -ex -o pipefail

    tar -xzf ${tarball}

    # Yes, these ls commands are simple and could be replaced by a glob, but
    # pretend that in the real life scenario, these are some more complicated,
    # opaque commands that generate a set of output files, with their
    # unpredictable arbitrary names saved in text files.
    ls -1 *.txt > fnames_txt.txt
    ls -1 *.xml > fnames_xml.txt
    ls -1 *.csv > fnames_csv.txt
  }

  output {
    File samplesheet = "SampleSheet.csv" # this works
    File runparams = select_first(glob("r*.xml")) # this works but is ugly (puts the file in a glob-xx/ subdir). Also I'm not certain that this would only select the first or if all the glob would delocalize anyway.
    #File runinfo = select_first(glob(read_lines("fnames_xml.txt"))) # this does not work (error: WorkflowManagerActor Workflow 900ea079-5628-418d-961e-ee48140312d5 failed (during ExecutingWorkflowState): Could not evaluate test_extract.runinfo = select_first(glob(read_lines("fnames_xml.txt"))))

    String runInfo_fname = select_first(read_lines("fnames_xml.txt")) # this is setup for next line
    File runinfo2 = "${runInfo_fname}" # this does not work (nop -- this becomes a String instead of coercing to a File)

    Array[File] flowcell_toplevel_txt = read_lines("fnames_txt.txt") # this does not work (nop: exits success but becomes Array[String] instead of Array[File]) -- this is the simplest form of what I would expect to work
    Array[File] flowcell_toplevel_csv = read_lines("fnames_csv.txt") # this works as expected! but only because we manually delocalized it above?
    #Array[File] flowcell_toplevel_xml = glob(read_lines("fnames_xml.txt")) # this does not work (runtime error similar to above (could not evaluate))
  }

  runtime {
    docker: "phusion/baseimage:0.9.22"
    memory: "2GB"
    cpu: 1
  }
}

workflow test_flow {
  call test_extract { input: tarball="gs://sabeti-public/dpark-test/AJH8U.tar.gz" }
}

Actual Cromwell stdout output (Cromwell v29 on Google JES backend):

 [info] SingleWorkflowRunnerActor workflow finished with status 'Succeeded'.
{
  "outputs": {
    "test_flow.test_extract.runparams": "gs://sabeti-temp-30d/dpark/cromwell-test/test_flow/df7a673f-0831-4fa6-98ff-cf76a34137a0/call-test_extract/glob-df22ec457c0d41d48cc9ba47a166d611/runParameters.xml",
    "test_flow.test_extract.flowcell_toplevel_txt": ["Basecalling_Netcopy_complete_Read1.txt", "Basecalling_Netcopy_complete_Read2.txt", "Basecalling_Netcopy_complete_Read3.txt", "Basecalling_Netcopy_complete_Read4.txt", "Basecalling_Netcopy_complete.txt", "ImageAnalysis_Netcopy_complete_Read1.txt", "ImageAnalysis_Netcopy_complete_Read2.txt", "ImageAnalysis_Netcopy_complete_Read3.txt", "ImageAnalysis_Netcopy_complete_Read4.txt", "ImageAnalysis_Netcopy_complete.txt", "RTAComplete.txt"],
    "test_flow.test_extract.runInfo_fname": "RunInfo.xml",
    "test_flow.test_extract.samplesheet": "gs://sabeti-temp-30d/dpark/cromwell-test/test_flow/df7a673f-0831-4fa6-98ff-cf76a34137a0/call-test_extract/SampleSheet.csv",
    "test_flow.test_extract.runinfo2": "RunInfo.xml",
    "test_flow.test_extract.flowcell_toplevel_csv": ["gs://sabeti-temp-30d/dpark/cromwell-test/test_flow/df7a673f-0831-4fa6-98ff-cf76a34137a0/call-test_extract/SampleSheet.csv"]
  },
  "id": "df7a673f-0831-4fa6-98ff-cf76a34137a0"
}
Tagged:

Best Answer

Answers

  • dparkdpark Member

    Thanks, that makes total sense to me--the fact that glob even works at all is where the confusion comes in, because it seems to be the exception to that general rule that the execution backend needs to know the output names in advance.

  • ChrisLChrisL Cambridge, MAMember, Broadie, Moderator, Dev ✭✭

    @dpark you're right, it is a very convenient exception, but can make handling the result a bit more awkward.

    In the case of globs, we do need to know the glob string before the command begins. So when we send our "execution request" to a backend it looks something like (don't quote me on this exact structure...):

    {
      "command_string": "...",
      "docker_image": "...",
      "files_to_localize_at_start": { "url://global/path/to/x.input" -> "x.input", ... }
      "files_to_delocalize_at_end": {
        "a.txt" -> "url://global/path/to/a.txt"
        "b.*" -> "url://global/path/to/b/"
      }
    }
    

    So we need to know "a.txt" before the job runs, and we need to know "b.*" to build the request, but we can find out that the glob matched b.txt by seeing what got delocalized to "url://global/path/to/b/".

Sign In or Register to comment.