Trouble with an intermediate gather step

Hi,

I'm working to migrate a pipeline from Snakemake to WDL so it can run on Firecloud. Unfortunately, I'm having difficulty migrating a specific bit of syntax and I was hoping for some advice.

The workflow in question does the following:
1) First, it runs a processing script on each of a set of VCFs.
2) Next, it computes a set of summary statistics from the aggregated processed VCFs.
3) Finally, it uses the summary statistics to apply a filter to each of the processed VCFs.

In Snakemake, this looks as follows:

with open('samples.list') as slist:
    SAMPLES = [s.strip() for s in slist.readlines()]

rule all:
    input:
        expand('filtered_vcfs/{sample}.vcf', sample=SAMPLES)

rule process_vcf:
    input:
        'raw_vcfs/{sample}.vcf'
    output:
        'processed_vcfs/{sample}.vcf'
    shell:
        "./scripts/process_vcf.py {input} {output}"

rule compute_summary_stats:
    input:
        expand('processed_vcfs/{sample}.vcf', sample=SAMPLES)
    output:
        'stats/summary.txt'
    script:
        "scripts/compute_summary_stats.py"

rule filter_vcf:
    input:
        vcf='processed_vcfs/{sample}.vcf',
        stats='stats/summary.txt'
    output:
        'filtered_vcfs/{sample}.vcf'
    shell:
        "./scripts/filter_vcf.py {input.vcf} {input.stats} {output}"

Which translates to the below graph of rule dependencies

However, I'm having difficulty translating this into WDL. I've set up a workflow that takes a tab delimited file of sample IDs and VCF filepaths as input. I can run a scatter over the resulting array, then call a task that gathers the results to compute the aggregated summary stats.

workflow preprocess {
  File vcflist

  Array[Array[String]] vcfs = read_tsv(vcflist)

  scatter (vcf in vcfs) {
    call process_vcf {
      input:
        sample=vcf[0],
        raw_vcf=vcf[1],
    }
  }

  call compute_summary_stats {
    input:
      vcfs=process_vcf.processed_vcf,
  }
}

I'm stuck trying to implement the filter step. The task needs to take both the filepath to the processed VCF and the sample ID as input, e.g.

task filter_vcf {
  File vcf
  File stats
  String sample

  command {
    ./scripts/filter_vcf.py ${vcf} ${stats} ${sample}.vcf
  }

  output {
    File filtered_vcf="${sample}.vcf"
  }
}

Is there a clean way to do this? I can run a scatter over the output of process_vcf, but it loses the associated sample IDs. Does gathering the results of a scatter preserve the input ordering? I was thinking if I could select the first entry of every array in vcfs I could zip that with process_vcf.processed_vcf, but I couldn't find the syntax for that in the WDL spec (the equivalent to Python's [x[0] for x in xs]).

Any help would be much appreciated!

Answers

  • orodehorodeh Mountain View, CAMember

    One way to do this is to also return the sample from the process_vcf task. Then, you can add a zip and scatter to the workflow. Something like:

    workflow preprocess {
      File vcflist
    
      Array[Array[String]] vcfs = read_tsv(vcflist)
    
      scatter (vcf in vcfs) {
        call process_vcf {
          input:
            sample=vcf[0],
            raw_vcf=vcf[1],
        }
      }
    
      call compute_summary_stats {
        input:
          vcfs=process_vcf.processed_vcf,
      }
    }
    
     scatter (p in zip(process_vcf.processed_vcf, process_vcf.sample)) {
        call filter_vcf { 
            input: 
                vcf=p.first,
                stats= compute_summary_stats.stats,
                sample=p.second
        }
    }
    
  • mstomsto MGHMember
    edited January 30

    Thanks for the suggestion! I tried implementing that but received the following error:

    Unable to build WOM node for Scatter '$scatter_1': Unable to build WOM node for WdlTaskCall 'filter_vcf': key not found: p.first
    Unable to build WOM node for Scatter '$scatter_1': Unable to build WOM node for WdlTaskCall 'filter_vcf': key not found: p.second
    

    I tried indexing into the pair instead, but received the following error:

    Unable to build WOM node for Scatter '$scatter_1': Unable to build WOM node for WdlTaskCall 'filter_vcf': Invalid indexing target. You cannot index a value of type 'Pair[File, String]'
    Unable to build WOM node for Scatter '$scatter_1': Unable to build WOM node for WdlTaskCall 'filter_vcf': Invalid indexing target. You cannot index a value of type 'Pair[File, String]'
    

    Any suggestions? Thanks!

    (Note: I'm running Cromwell 30.2)

    EDIT: looks like p.left and p.right are the correct syntax. It worked, thank you for the suggestion!

  • ChrisLChrisL Cambridge, MAMember, Broadie, Moderator, Dev ✭✭

    This is probably a little too hard to find in the spec, but pair access is via p.left and p.right

  • ChrisLChrisL Cambridge, MAMember, Broadie, Moderator, Dev ✭✭
  • orodehorodeh Mountain View, CAMember

    Sorry, that was my bad.

Sign In or Register to comment.