Two level scatter gather

Suppose I want to scatter on the output result of some previous scatter gather. Pseudo code below.

workflow test {
Array[File] -things= read_tsv (inputfile)
scatter (thing in things){
call task1(input stuff=thing)
scatter (object in task1.array){
call task2(input stuff=object)
}

}
}
Example would be that I want to call BWA on 10 samples, but I want to breakup the fastq files into 10 pieces to call BWA in parallel to make it faster.
Right now WDL complains that it "could not get a value for task1". What could I do to make it work

Best Answer

Answers

  • ChrisLChrisL Cambridge, MAMember, Broadie, Moderator, Dev

    @KateN @awacs I just wanted to post an update to this, now that things have settled out a bit. Scatters-within-scatters is unlikely to happen soon, because of the way Cromwell indexes the shards of a scatter (e.g. we can reference scatter shard 10, but we can't support shard 10:10 for the inner scatter, for example).

    The good news is that, depending on your scenario, there will now be a way to express what you want.

    • If you wanted to use scatters as a way of iterating over two arrays, (e.g. for x in xs: ... for y in ys: ..), you can use the cross(xs, ys) function to get an Array[Pair[X,Y]].

    • If that fails, you can call subworkflows from inside a scatter, and subworkflows can contain their own scatters. So you can do something like:

    • main.wdl

    import sub.wdl as sub
    workflow test {
      Array[File] things= read_tsv (inputfile)
      scatter (thing in things){
        call sub.sub_wf { input: thing = thing }
      }
    }
    
    • sub.wdl
    task task1 { ... }
    task task2 { ... }
    workflow sub_wf {
      File thing
      call task1(input stuff=thing)
      scatter (object in task1.array){
        call task2(input stuff=object)
      }
    }
    

    Issue · Github
    by Geraldine_VdAuwera

    Issue Number
    1995
    State
    open
    Last Updated
Sign In or Register to comment.