To celebrate the release of GATK 4.0, we are giving away free credits for running the GATK4 Best Practices pipelines in FireCloud, our secure online analysis portal. It’s first come first serve, so sign up now to claim your free credits worth $250. Sponsored by Google Cloud. Learn more at https://software.broadinstitute.org/firecloud/documentation/freecredits

Trouble with an intermediate gather step

Hi,

I'm working to migrate a pipeline from Snakemake to WDL so it can run on Firecloud. Unfortunately, I'm having difficulty migrating a specific bit of syntax and I was hoping for some advice.

The workflow in question does the following:
1) First, it runs a processing script on each of a set of VCFs.
2) Next, it computes a set of summary statistics from the aggregated processed VCFs.
3) Finally, it uses the summary statistics to apply a filter to each of the processed VCFs.

In Snakemake, this looks as follows:

with open('samples.list') as slist:
    SAMPLES = [s.strip() for s in slist.readlines()]

rule all:
    input:
        expand('filtered_vcfs/{sample}.vcf', sample=SAMPLES)

rule process_vcf:
    input:
        'raw_vcfs/{sample}.vcf'
    output:
        'processed_vcfs/{sample}.vcf'
    shell:
        "./scripts/process_vcf.py {input} {output}"

rule compute_summary_stats:
    input:
        expand('processed_vcfs/{sample}.vcf', sample=SAMPLES)
    output:
        'stats/summary.txt'
    script:
        "scripts/compute_summary_stats.py"

rule filter_vcf:
    input:
        vcf='processed_vcfs/{sample}.vcf',
        stats='stats/summary.txt'
    output:
        'filtered_vcfs/{sample}.vcf'
    shell:
        "./scripts/filter_vcf.py {input.vcf} {input.stats} {output}"

Which translates to the below graph of rule dependencies

However, I'm having difficulty translating this into WDL. I've set up a workflow that takes a tab delimited file of sample IDs and VCF filepaths as input. I can run a scatter over the resulting array, then call a task that gathers the results to compute the aggregated summary stats.

workflow preprocess {
  File vcflist

  Array[Array[String]] vcfs = read_tsv(vcflist)

  scatter (vcf in vcfs) {
    call process_vcf {
      input:
        sample=vcf[0],
        raw_vcf=vcf[1],
    }
  }

  call compute_summary_stats {
    input:
      vcfs=process_vcf.processed_vcf,
  }
}

I'm stuck trying to implement the filter step. The task needs to take both the filepath to the processed VCF and the sample ID as input, e.g.

task filter_vcf {
  File vcf
  File stats
  String sample

  command {
    ./scripts/filter_vcf.py ${vcf} ${stats} ${sample}.vcf
  }

  output {
    File filtered_vcf="${sample}.vcf"
  }
}

Is there a clean way to do this? I can run a scatter over the output of process_vcf, but it loses the associated sample IDs. Does gathering the results of a scatter preserve the input ordering? I was thinking if I could select the first entry of every array in vcfs I could zip that with process_vcf.processed_vcf, but I couldn't find the syntax for that in the WDL spec (the equivalent to Python's [x[0] for x in xs]).

Any help would be much appreciated!

Answers

  • orodehorodeh Mountain View, CAMember

    One way to do this is to also return the sample from the process_vcf task. Then, you can add a zip and scatter to the workflow. Something like:

    workflow preprocess {
      File vcflist
    
      Array[Array[String]] vcfs = read_tsv(vcflist)
    
      scatter (vcf in vcfs) {
        call process_vcf {
          input:
            sample=vcf[0],
            raw_vcf=vcf[1],
        }
      }
    
      call compute_summary_stats {
        input:
          vcfs=process_vcf.processed_vcf,
      }
    }
    
     scatter (p in zip(process_vcf.processed_vcf, process_vcf.sample)) {
        call filter_vcf { 
            input: 
                vcf=p.first,
                stats= compute_summary_stats.stats,
                sample=p.second
        }
    }
    
  • mstomsto MGHMember
    edited January 30

    Thanks for the suggestion! I tried implementing that but received the following error:

    Unable to build WOM node for Scatter '$scatter_1': Unable to build WOM node for WdlTaskCall 'filter_vcf': key not found: p.first
    Unable to build WOM node for Scatter '$scatter_1': Unable to build WOM node for WdlTaskCall 'filter_vcf': key not found: p.second
    

    I tried indexing into the pair instead, but received the following error:

    Unable to build WOM node for Scatter '$scatter_1': Unable to build WOM node for WdlTaskCall 'filter_vcf': Invalid indexing target. You cannot index a value of type 'Pair[File, String]'
    Unable to build WOM node for Scatter '$scatter_1': Unable to build WOM node for WdlTaskCall 'filter_vcf': Invalid indexing target. You cannot index a value of type 'Pair[File, String]'
    

    Any suggestions? Thanks!

    (Note: I'm running Cromwell 30.2)

    EDIT: looks like p.left and p.right are the correct syntax. It worked, thank you for the suggestion!

  • ChrisLChrisL Cambridge, MAMember, Broadie, Moderator, Dev

    This is probably a little too hard to find in the spec, but pair access is via p.left and p.right

  • ChrisLChrisL Cambridge, MAMember, Broadie, Moderator, Dev
  • orodehorodeh Mountain View, CAMember

    Sorry, that was my bad.

Sign In or Register to comment.