Forum Login Issue:
Currently the "Log in with Google" button redirects you to a "Page not found." Our forum vendors have implemented a fix, and now we are just waiting on a patch to be released. In the meantime, while on the "Page not found" you can edit the URL to delete the second gatk, firecloud, or wdl (depending on what subforum you are acessing).

Dividing files for a scatter

Hello, is there a way to divide a group of Files for a scatter rather than scattering the individual Files?
I have an Array of Files as the output of one of my tasks and, to begin with, I'd like to divide half of them and send them to one scatter and the other half to another scattered task (in the future, we'd like to be able to divide the files in arbitrary ways).
I can't figure out a way to do this with WDL; how does everyone accomplish this?

Best Answer


  • ThibThib CambridgeMember, Broadie, Dev
    edited April 4

    The first thing I can think of is instead of scattering over the files directly you can scatter over an array of their indices (or a subset of them) and then dereference the array in the scatter block.
    For example:

    workflow w {
        Array[File] files
        Int half_index = length(files) / 2
        # First half
        scatter(i in range(half_index)) {
            File f = files[i]
        # Second half
        scatter(i in range(half_index)) {
            File f = files[i + half_index]
  • Yes, I think this is closer to what we're looking for!
    Is there a way to address the array with a range index so that we can send half the array in one scatter call and the other half in the second scatter call?

    In Python we would do something like: files[0:half_index]

  • Ahh rats. Ok thanks!

  • Just wanted to share the little hack we've implemented to achieve splitting of groups of files - in the task that is scattered, we perform a bulk copy of all the files that we want to split (yes, it's inefficient to copy all files to each scattered VM but we'll work on making that more efficient later) and then use the Unix split command to evenly divide the listing of files into N segments. We then move all the files listed in the Nth segment file to a different directory and then pass this directory to the tool on the scattered VM. Some code:

    task foo_task {
      File file
      Int scatter_index
      command <<<
        FILE=$(echo ${file} | sed -e "s/\/cromwell_root\///g")
        DIR=$(dirname $FILE)
        mkdir -p /cromwell_root/staging_dir
        time gsutil -q -m cp -R gs://$DIR /cromwell_root/staging_dir
        ls /cromwell_root/staging_dir/$(basename $DIR) > gsc.list
        # Split the entire list of files into exactly 2 using the Unix split command, into files list-00 and list-01
        split -n 2 -d gsc.list list-
        cat list-0${scatter_index} | xargs -I filename mv /cromwell_root/staging_dir/$(basename $DIR)/filename /cromwell_root/split_input_files

    The scatter_index is set back in the workflow:

      Array[Int] scatter_index = [0,1]
      scatter(i in scatter_index) {
        call foo_task { input: scatter_index=i, file=previous_task.output_files[0] }
Sign In or Register to comment.