Cromwell input file grouping inconsistencies

In a discussion from 2017, @KateN implied that File inputs to a task would all be grouped in the same directory.

When I run the GATK Best Practice HaplotypeCaller workflow as a standalone workflow, this does seem to be true:

$find
.
./execution
./execution/script
./execution/script.submit
./execution/stdout.submit
./execution/stderr.submit
./execution/stdout
./execution/stderr
./execution/K000032_1_lane_dupsFlagged_sm_tagged.g.vcf.gz
./execution/K000032_1_lane_dupsFlagged_sm_tagged.g.vcf.gz.tbi
./execution/rc
./inputs
./inputs/691592025
./inputs/691592025/GRCh37-lite.fa.fai
./inputs/691592025/GRCh37-lite.fa
./inputs/691592025/GRCh37-lite.dict
./inputs/691592025/K000032_1_lane_dupsFlagged_sm_tagged.bam.bai
./inputs/691592025/K000032_1_lane_dupsFlagged_sm_tagged.bam
./inputs/648028064
./inputs/648028064/scattered.interval_list
./tmp.72c8af67

However, when I run it as a subworkflow, it splits up the inputs into separate subdirectories:

$ find
.
./execution
./execution/script
./execution/script.submit
./execution/stdout.submit
./execution/stderr.submit
./execution/stdout
./execution/stderr
./execution/NA12878_10X_downsampled.g.vcf.gz
./inputs
./inputs/-1113293895
./inputs/-1113293895/scattered.interval_list
./inputs/-1371904032
./inputs/-1371904032/GRCh37-lite.dict
./inputs/-1371904032/GRCh37-lite.fa.fai
./inputs/-1371904032/GRCh37-lite.fa
./inputs/-1931584957
./inputs/-1931584957/NA12878_10X_downsampled.bam
./inputs/-1931584957/NA12878_10X_downsampled.bam.bai
./tmp.8c08232a

And when I run it with inputs that are the result of a scatter operation, it splits these up further, such that it puts the bam and the bam index in separate places, and HaplotypeCaller fails:

$ find
.
./execution
./execution/script
./execution/script.submit
./execution/stdout.submit
./execution/stderr.submit
./execution/stdout
./execution/stderr
./execution/NA12878_5X_downsampled.g.vcf.gz
./execution/NA12878_5X_downsampled.g.vcf.gz.tbi
./execution/rc
./inputs
./inputs/-1371904032
./inputs/-1371904032/GRCh37-lite.dict
./inputs/-1371904032/GRCh37-lite.fa
./inputs/-1371904032/GRCh37-lite.fa.fai
./inputs/-216796705
./inputs/-216796705/NA12878_5X_downsampled.bam.bai
./inputs/-1113293895
./inputs/-1113293895/scattered.interval_list
./inputs/-1204402517
./inputs/-1204402517/NA12878_5X_downsampled.bam

How does Cromwell decide where to put inputs to a task? How can I coerce it into putting them in the same directory?

I am using Cromwell36, with the Slurm backend.

Best Answer

Answers

  • oneillkzaoneillkza Member

    Note: for now I am hacking around this using the following:

    command 
    <<<
    path=$(dirname ${input_bam})
    mv ${input_bam_index} $path
    java -jar ....
    
  • oneillkzaoneillkza Member

    Hi @ChrisL -- thanks, that would explain it. In order to scatter over a directory of bams, I have to run two separate glob tasks, one to find all the bam files, and another to find all the bais. Since these glob tasks get executed separately, the files end up in separate places going into the Haplotypecaller task.

    I guess this is something which the Directory type, when it's implemented, will largely solve.

    Is there a good recipe somewhere for how to use Cromwell for the (fairly common) problem of scattering a task over a directory of input files with accompanying index files?

    I've kind of had to piece together a way of doing this from a lot of discussion threads on this forum and Github, and the way I'm doing it feels a little fragile.

  • ChrisLChrisL Cambridge, MAMember, Broadie, Moderator, Dev admin

    I think you're probably most of the way there.

    Is there any reason you have to glob separately? Eg I'd assume something like this would work:

    output {
      Array[File] bams = glob("*.bam")
      Array[File] bais = glob("*.bam")
      Array[Pair[File, File]] bams_and_indexes = zip(bams, bais)
    }
    

    Then say you have a task that takes in two files and the tool needs them to be together before running - actually, moving them together as part of the command (ie ensuring that the command sets up the execution directory how the tool expects it) is 100% a valid and appropriate thing to do in the workflow! Indeed, it makes the workflow more portable in cases where the files come from different origins.

Sign In or Register to comment.