Update: July 26, 2019
This section of the forum is now closed; we are working on a new support model for WDL that we will share here shortly. For Cromwell-specific issues, see the Cromwell docs and post questions on Github.

How am I supposed to specify directories with WDL?

nessus42nessus42 Member

I have a big directory (1 TB) of input files that I'd like to process using Cromwell. If I specify the directory as a File, then it copies the entire huge directory because it can't hard-link to the directory. If I specify the directory as a String, then a WDL task can't find the directory, because the task runs in a subdir and the path that I specified for the directory no longer makes sense in that subdir.

If I glob all the files in the directory, and pass the resulting Array[File] around instead, then things break because the shell command lines become too long.

Okay, I could specify a full path to the directory as a string, but I don't want to do that, since I run the same WDL job on different computers and the paths to my working directory is different on the different computers. and I just rsync the working directory. It's much better if I can specify everything with relative paths.

I suppose I could make a shell alias that sets an environment variable to $PWD and then I could have a WDL task that constructs a full path to the directory using the environment variable and the relative path to the directory. This seems like a huge kludge, however.

Is there a way that this sort of thing is supposed to be done?

Best Answer

  • nessus42nessus42
    Accepted Answer

    I figured out a somewhat kludgey workaround to achieve what I want:

    task expandPreMadeDataDir {
        File preMadeDataDir
    
        command {
           dir=`pwd`; \
           cd "${preMadeDataDir}"; \
           ln *.gz "$dir"; \
           cd "$dir"; \
           ls > file-list.txt
    
        }
    
        runtime {
           docker: "nessus/mite-seq-process"
        }
    
        output {
            Array[File] files = read_lines("file-list.txt")
        }
    }
    

    This works both running natively and running using a Docker image.

    There is one remaining issue, however: If the directory has many, many files in it, then the glob in ln *.gz "$dir" could overflow the shell command line, causing the task to fail.

Answers

  • RuchiRuchi Member, Broadie, Dev admin

    @nessus42 Directories aren't truly a supported type in WDL, but it's possible we can find a workaround for the problem you're encountering. It would be nice to get some more information:
    1. What version of Cromwell are you using?
    2. Which backend are you utilizing?
    3. Would you be able to share the WDL and inputs for this particular case?

    Thanks!

  • nessus42nessus42 Member
    edited July 2017

    Hi Ruchi, sorry for the delay in responding.

    I'm using Cromwell 28.

    For now I want to make my code work with the local backend, both with and without Docker.

    I've been able to get things to work with the non-Docker local backend with a task that looks like this:

    task expandPreMadeDataDir {
        String preMadeDataDir
    
        command {
           dir="${preMadeDataDir}"; \
           if [ $(expr "$dir" : '\(.\).*') == '/' ]; \
              then ls -d "${preMadeDataDir}"/*.gz; \
              else ls -d "$WORKDIR"/"${preMadeDataDir}"/*.gz; \
           fi
        }
    
        output {
            Array[File] files = read_lines(stdout())
        }
    }
    

    The fancy shell conditional is to do something different depending on whether I get a relative or absolute path passed in. (I define $WORKDIR in a Makefile that I use to start up the job.)

    The above works fine if I don't use Docker, but if I add Docker to the mix, then I get the following error:

    ls: cannot access /rnai/archive/rnai/screening/mite-seq/test-data/fastq-input/*.gz: No such file or directory

    Surely I can't be the only Cromwell user who wants to operate on a directory full of data files? Or do people typically pass in tar balls if they need to do this kind of thing?

    |>oug ([email protected])

  • nessus42nessus42 Member
    Accepted Answer

    I figured out a somewhat kludgey workaround to achieve what I want:

    task expandPreMadeDataDir {
        File preMadeDataDir
    
        command {
           dir=`pwd`; \
           cd "${preMadeDataDir}"; \
           ln *.gz "$dir"; \
           cd "$dir"; \
           ls > file-list.txt
    
        }
    
        runtime {
           docker: "nessus/mite-seq-process"
        }
    
        output {
            Array[File] files = read_lines("file-list.txt")
        }
    }
    

    This works both running natively and running using a Docker image.

    There is one remaining issue, however: If the directory has many, many files in it, then the glob in ln *.gz "$dir" could overflow the shell command line, causing the task to fail.

  • ChrisLChrisL Cambridge, MAMember, Broadie, Dev admin

    Your solution seems to be replicating the built-in glob functionality?

    task foo {
      command {
        ./make_some_files
      }
      output {
        Array[File] files = glob("*.gz")
      }
    }
    
  • nessus42nessus42 Member

    @ChrisL I have to link the files into the working directory or it doesn't work when running with Docker. (Such contortions are not needed when running natively.)

    Yes, I could link the files into the current directory and then glob in the output section, but then I'd be globbing twice.

    My solution has the advantage–which I didn't mention in my comment but which I should have–that in a future version I can replace the command line glob with a call to a Python script to do the globbing and linking, and that would be immunue to overflowing the shell command line.

    The glob() command is not, unfortunately, immunue to overflowing the shell command line, however, because if I look inside the scripts that are produced by Cromwell, I can see that the glob() command gets turned into a shell glob inside of a shell script.

    I worry about this, because I have had data directories that did in fact overflow the shell command line.

  • Redmar_van_den_BergRedmar_van_den_Berg Member ✭✭
    edited July 2017

    I have created a wrapper that allows you to use arbitrary nested folders with wdl. The resulting output can be passed to local task or tasks that run with docker. The only caveat is that read_folder cannot be executed using docker. Feel free to post if you have an improved version were read_folder runs properly under docker.

    task read_folder{
        String folder
    
        command {
        pwd=`pwd`
        for path in `find "${folder}"`; do
          if [ -d "$path" ];then
             mkdir -p $pwd/$path
          elif [ -f "$path" ]; then
             newfile=$pwd/$path
             echo "$newfile"
             ln -s $path $newfile
          fi  
        done
        }
    
        ################# IMPORTANT ####################
        # read_folder doesn't work when ran with docker#
        #runtime {                                     #
        #    docker: "${image}"                        #
        #}                                             #
        ################# IMPORTANT ####################
    
        output {
        Array[File] files = read_lines(stdout())
        }
    }
    
    task filesize {
        Array[File] files
    
        command {
            for file in ${sep=" " files}; do
                du -sh "$file"
            done
        }
    
        runtime {
            docker: "ubuntu:16.04"
        }
    
        output {
            File size = stdout()
        }
    }
    
    workflow folder {
        String folder_path
    
        call read_folder {
            input: folder = folder_path
        }
    
        call filesize {
            input: files = read_folder.files
        }
    }
    
Sign In or Register to comment.