Is there a simple way to regenerate the directory structure of scattered files?

dheimandheiman ✭✭Member, Broadie ✭✭
edited April 2018 in Ask the FireCloud Team

It's a common paradigm for tools to expect files to be organized in a certain way. Is there a simple way to preserve the initial organization of files? E.g. They are loaded in a directory in the workspace bucket, and they are scattered to do some pre-processing, and I want to gather them back into the same initial organization for processing together.

I have some example code for a method I've come up with to do this - in this case I'm simply unzipping a bunch of individual files, returning them to their original directory structure, then zipping them together en masse to pass them to whatever comes next pre-organized.

task gunzip {
    File archive
    String file = sub(archive, "\\.gz$", "")

    command {
        set -euo pipefail
        # According to example 2 of
        # https://github.com/openwdl/wdl/blob/develop/SPEC.md#string-substring-string-string
        # the mkdir shouldn't be necessary, but this fails without it.
        mkdir -p $(dirname ${file})

        zcat -f ${archive} > ${file}
    }

    output {
        File files = file
    }

    runtime {
        docker : "broadgdac/firecloud-ubuntu:16.04"
    }

    meta {
        author : "David Heiman"
        email : "[email protected]"
    }
}

task zip {
    Array[File] files

    command <<<
        set -euo pipefail

        strtdir=`dirname ${select_first(files)}`

        # The original base directory is the first directory after .*/shard-[0-9]+/(execute/)?
        basedir=$(basename `echo $strtdir | sed 's|^.*/shard-[0-9]\{1,\}/\(execution/\)\{0,1\}\([^/]\{1,\}\)/.*$|\2|'`)
        mkdir $basedir

        # Start the search for base directories one directory above shard-[0-9]+/
        # Copy/symlink the contents recursively to the base directory in the working directory
        rootdir=`echo $strtdir | sed 's|^\(.*\)/shard-[0-9].*$|\1|'`
        find $rootdir -name $basedir -type d -exec bash -c 'cp -r -s "$1"/* "$2"' Cp {} $basedir \;

        # Create an archive preserving the recreated file paths
        zip -r files_archive.zip $basedir
    >>>

    output {
        File files_archive="files_archive.zip"
    }

    runtime {
        docker : "broadgdac/firecloud-ubuntu:16.04"
    }

    meta {
        author : "David Heiman"
        email : "[email protected]"
    }
}

workflow merge_archives {
    Array[File] archives

    scatter (archive in archives) {
        call gunzip {input: archive=archive}
    }

    call zip {input: files=gunzip.files}

    output {zip.files_archive}
}

I feel like there must/should be a simpler way to do this.

Thanks!

Answers

  • abaumannabaumann ✭✭✭ Broad DSDEMember, Broadie ✭✭✭

    What you want is this correct? https://github.com/openwdl/wdl/pull/173

    That's in progress in openwdl, but not hugely active at the moment, and after that it would need to be added to Cromwell. For now, zip or tar or something else is really your best bet unfortunately

  • dheimandheiman ✭✭ Member, Broadie ✭✭

    Not quite, the issue is that sharding via scatter-gather modifies the directory structure of the individual files such that the only way to regenerate the original path is to copy/symlink everything beneath the shard-[0-9]+ directory to the working directory or a subdirectory thereof in the gather step.

    It also turns out that the above code does not work when running Cromwell with the PAPI backend (i.e. FireCloud). Since in FireCloud File archive is not localized until the command block, String file = sub(archive, "\\.gz$", "") resolves to a google bucket path,String file = sub(sub(archive, "\\.gz$", ""), "^gs://", "") might work, but again the complexity of doing something so straight-forward as "do this thing to each file in this directory at the same time, then put the results in the same place" feels really contrived for what is a fairly common pattern with scientific software.

  • abaumannabaumann ✭✭✭ Broad DSDEMember, Broadie ✭✭✭

    @Ruchi might know of another way around this, or otherwise some feature we could implement to make this easier

  • RuchiRuchi admin Member, Broadie, Moderator, Dev admin

    Hey @dheiman,

    I might be misunderstanding here, are you trying to take some previously created workflow outputs and archive them into a zipped file in a flattened structure?

    Thanks!

Sign In or Register to comment.