Dividing files for a scatter

Hello, is there a way to divide a group of Files for a scatter rather than scattering the individual Files?
I have an Array of Files as the output of one of my tasks and, to begin with, I'd like to divide half of them and send them to one scatter and the other half to another scattered task (in the future, we'd like to be able to divide the files in arbitrary ways).
I can't figure out a way to do this with WDL; how does everyone accomplish this?

Best Answer

Answers

  • ThibThib CambridgeMember, Broadie, Dev
    edited April 4

    The first thing I can think of is instead of scattering over the files directly you can scatter over an array of their indices (or a subset of them) and then dereference the array in the scatter block.
    For example:

    workflow w {
        Array[File] files
        Int half_index = length(files) / 2
    
        # First half
        scatter(i in range(half_index)) {
            File f = files[i]
        }
    
        # Second half
        scatter(i in range(half_index)) {
            File f = files[i + half_index]
            ...
        }
    }
    
  • Yes, I think this is closer to what we're looking for!
    Is there a way to address the array with a range index so that we can send half the array in one scatter call and the other half in the second scatter call?

    In Python we would do something like: files[0:half_index]

  • Ahh rats. Ok thanks!

  • Just wanted to share the little hack we've implemented to achieve splitting of groups of files - in the task that is scattered, we perform a bulk copy of all the files that we want to split (yes, it's inefficient to copy all files to each scattered VM but we'll work on making that more efficient later) and then use the Unix split command to evenly divide the listing of files into N segments. We then move all the files listed in the Nth segment file to a different directory and then pass this directory to the tool on the scattered VM. Some code:

    task foo_task {
      File file
      Int scatter_index
    
      command <<<
        FILE=$(echo ${file} | sed -e "s/\/cromwell_root\///g")
        DIR=$(dirname $FILE)
    
        mkdir -p /cromwell_root/staging_dir
        time gsutil -q -m cp -R gs://$DIR /cromwell_root/staging_dir
    
        ls /cromwell_root/staging_dir/$(basename $DIR) > gsc.list
        # Split the entire list of files into exactly 2 using the Unix split command, into files list-00 and list-01
        split -n 2 -d gsc.list list-
    
        cat list-0${scatter_index} | xargs -I filename mv /cromwell_root/staging_dir/$(basename $DIR)/filename /cromwell_root/split_input_files
      >>>
      ..
    }
    

    The scatter_index is set back in the workflow:

      Array[Int] scatter_index = [0,1]
      scatter(i in scatter_index) {
        call foo_task { input: scatter_index=i, file=previous_task.output_files[0] }
      }
    
Sign In or Register to comment.