Update: July 26, 2019
This section of the forum is now closed; we are working on a new support model for WDL that we will share here shortly. For Cromwell-specific issues, see the Cromwell docs and post questions on Github.

Secondary index files and directories in WDL

WDL folks;
This is a followup from a recent discussion about getting compatible bcbio generated WDL (http://gatkforums.broadinstitute.org/wdl/discussion/9257/object-attribute-access-and-secondary-index-files). Thanks to all the great help you've provided we now have compatible WDL output that passes validation:

https://github.com/bcbio/test_bcbio_cwl/blob/master/run_info-cwl-wdl

This is brilliant, and I'd like to move into testing runs with Cromwell. Before starting this, there is one major area I know we're missing in the conversion, handling of secondary files and directories of files. CWL has the notion of secondaryFiles (http://www.commonwl.org/v1.0/Workflow.html#File) which you can use to block these and ensure they get staged/run next to each other. I use this in bcbio and wanted to figure out the best way to map it into WDL.

There are two cases we use these for:

  • Index files associated with compressed inputs, like BAM bai indices and bgzip VCF tbi indices. These are a single index file attached to the original file that should get staged in the same directory when running.
  • Directories of index files like bwa or snpeff. These are a bit trickier since they can have many files and a variable number depending on the input.

What is the recommended way to deal with these cases in WDL? I'll have to re-engineer bcbio to be able to represent and pass these and wanted to do so in a way that was forward compatible with WDL's thoughts and plans. I've seen recommendations on current hacks like explicitly declaring the indexes as separate files, or tarring up a directory of files and passing that as input. I'm not clear enough on staging files from WDL/Cromwell to understand if these are guaranteed to always go in the right place (bai next to bam, all indexes in the same directory).

Thanks for any thoughts/suggestions/tips.

Tagged:

Issue · Github
by Geraldine_VdAuwera

Issue Number
1996
State
closed
Last Updated
Assignee
Array
Closed By
katevoss

Best Answer

Answers

  • chapmanbchapmanb Boston, MAMember

    Kate;
    Thanks so much for the detailed answer, this is a big help in terms of planning how to modify the workflow. Do you have rough timelines on adding explicit support for secondary files? I'm mostly trying to decide if it would be more practical to re-architect how we represent these or be lazy and wait for y'all to have a cleaner path to supplying these. Thanks again for the helpful discussion.

  • KateNKateN Cambridge, MAMember, Broadie, Moderator admin
    At this point we do not have an ETA on these secondary files, so you would be better off implimenting the workflow now and editing it later when we are able to provide this feature.
  • MigwellMigwell Member

    I'm glad there's a workaround for this. However, is it true that this only works if the File is declared by itself, and not as part of a compound object? Because I've been trying to pass around BAMs and their indices as a Pair (e.g. Pair[File, File] tumourBam), but it doesn't look like this forces the file to be localised.

  • RuchiRuchi Member, Broadie, Moderator, Dev admin

    Hey @Migwell

    I have a simple workflow below where my only input to a task is a pair of files:

    workflow pairs {
      File bam
      File bamIndex
      Pair[File, File] myFiles = (bam, bamIndex)
    
      call catFiles {input: files = myFiles }   
    }
    
    
    task catFiles {
      Pair[File, File] files 
    
      command {
       cat ${files.left}
       cat ${files.right}
      }
    }
    

    When I check the inputs directory for call-catFiles, I see both the bam and bamIndex files have been localized. Are you experiencing trouble with a task localizing a pair of files? Would you mind sharing your WDL source file? Thanks!

  • oneillkzaoneillkza Member

    Hmmm ... so this doesn't seem to work for subworkflows.

    Structure of an execution directory for HaplotypeCaller when called standalone:

    $find
    .
    ./execution
    ./execution/script
    ./execution/script.submit
    ./execution/stdout.submit
    ./execution/stderr.submit
    ./execution/stdout
    ./execution/stderr
    ./execution/K000032_1_lane_dupsFlagged_sm_tagged.g.vcf.gz
    ./execution/K000032_1_lane_dupsFlagged_sm_tagged.g.vcf.gz.tbi
    ./execution/rc
    ./inputs
    ./inputs/691592025
    ./inputs/691592025/GRCh37-lite.fa.fai
    ./inputs/691592025/GRCh37-lite.fa
    ./inputs/691592025/GRCh37-lite.dict
    ./inputs/691592025/K000032_1_lane_dupsFlagged_sm_tagged.bam.bai
    ./inputs/691592025/K000032_1_lane_dupsFlagged_sm_tagged.bam
    ./inputs/648028064
    ./inputs/648028064/scattered.interval_list
    ./tmp.72c8af67
    

    Structure for a HaplotypeCaller execution directory when called as part of a subworkflow:

    $ find
    .
    ./execution
    ./execution/script
    ./execution/script.submit
    ./execution/stdout.submit
    ./execution/stderr.submit
    ./execution/stdout
    ./execution/stderr
    ./execution/NA12878_5X_downsampled.g.vcf.gz
    ./execution/NA12878_5X_downsampled.g.vcf.gz.tbi
    ./execution/rc
    ./inputs
    ./inputs/-1371904032
    ./inputs/-1371904032/GRCh37-lite.dict
    ./inputs/-1371904032/GRCh37-lite.fa
    ./inputs/-1371904032/GRCh37-lite.fa.fai
    ./inputs/-216796705
    ./inputs/-216796705/NA12878_5X_downsampled.bam.bai
    ./inputs/-1113293895
    ./inputs/-1113293895/scattered.interval_list
    ./inputs/-1204402517
    ./inputs/-1204402517/NA12878_5X_downsampled.bam
    

    Actually, I'm going to make a separate thread for this.

  • oneillkzaoneillkza Member

    It seems like this might be a result of passing the product of a scatter down to the task. Here's the same task run as a subworkflow but without the scatter:

    $ find
    .
    ./execution
    ./execution/script
    ./execution/script.submit
    ./execution/stdout.submit
    ./execution/stderr.submit
    ./execution/stdout
    ./execution/stderr
    ./execution/NA12878_10X_downsampled.g.vcf.gz
    ./inputs
    ./inputs/-1113293895
    ./inputs/-1113293895/scattered.interval_list
    ./inputs/-1371904032
    ./inputs/-1371904032/GRCh37-lite.dict
    ./inputs/-1371904032/GRCh37-lite.fa.fai
    ./inputs/-1371904032/GRCh37-lite.fa
    ./inputs/-1931584957
    ./inputs/-1931584957/NA12878_10X_downsampled.bam
    ./inputs/-1931584957/NA12878_10X_downsampled.bam.bai
    ./tmp.8c08232a
    

    Now the .bam and .bai are co-localised, but at lot of the inputs are split up still. It's not clear to me how Cromwell decides which inputs to group and which not to.

Sign In or Register to comment.