How to process Array[Array[File]] outside a scatter?

myourshawmyourshaw University of ColoradoMember

What is the best practice for dealing with an Array[Array[File]] that was created by a scatter operation? A simplified, abbreviated example of my use case:

scatter (lane in lanes) {
    call IlluminaBasecallsToSam {
        input:
            lane = lane,
            ...
    }
    # Inside the scatter, IlluminaBasecallsToSam.ubams is type Array[File]
    # where the files in the Array may be multiple libraries in the given lane.
}

# Outside the scatter, IlluminaBasecallsToSam.ubams is type Array[Array[File]]
# where the outer Array is lanes and the inner Array is libraries.

task IlluminaBasecallsToSam {
    ...
    output {
        Array[File] ubams = glob("*.bam")
    }
}

Outside the scatter, I want to process all the files, for example by reorganizing the data so the outer array is libraries and the inner array is lanes, in order to run a single sample workflow, which takes as an input an Array[File] of library ubams from all relevant lanes.

"sep" doesn't seem to be helpful in this case. I can't figure out an elegant way to do this in wdl.

Best Answer

Answers

  • myourshawmyourshaw University of ColoradoMember

    My current approach is ugly:

    task SelectBamsByLibrary {
      String python3_cmd
      Array[Array[File]] bams
      String library
    
      command <<<
        ${python3_cmd} <<CODE
        import pysam
    
        # lines are tab-separated list of bams for a lane
        # expecting only one readgroup per bam, but check all just in case
        # and don't duplicate selected bams
        bam_set = set()
        with open("${write_tsv(bams)}", 'r') as all_bams:
            for lane_bams in all_bams:
                for bam in lane_bams.strip().split('\t'):
                    # check whether LB tag matches library
                    with pysam.AlignmentFile(bam, "rb", check_sq=False) as b:
                        for rg in b.header['RG']:
                            if rg['LB'] == "${library}":
                                bam_set.add(bam)
        for b in sorted(bam_set):
            print(b)
        CODE
      >>>
      runtime {
        memory: "1G"
        cpu: 1
      }
      output {
        Array[String] library_bams = read_lines(stdout())
      }
    }
    
  • myourshawmyourshaw University of ColoradoMember

    flatten() is just what I needed. It is not documented in the current stable WDL spec, but is supported by the Cromwell 31. Which raises the question: what is the status of read_json() and write_json()?

  • ChrisLChrisL Cambridge, MAMember, Broadie, Moderator, Dev

    @myourshaw interesting - thanks for pointing that out!

    I had a look at the spec history and it looks like the flatten spec change got lost somehow in the transition to supporting multiple versions of the spec at the same time.

    I've redone that change and you should now be able to find the documentation here: https://github.com/openwdl/wdl/blob/master/versions/draft-2/SPEC.md#arrayx-flattenarrayarrayx

  • ChrisLChrisL Cambridge, MAMember, Broadie, Moderator, Dev

    The read_json and write_json functions should work as you'd expect in Cromwell 31. The PR which added them was https://github.com/broadinstitute/cromwell/pull/3175

    Let us know (preferably in a new forum post!) if you have any issues with them. Thanks!

Sign In or Register to comment.