Currently the "Log in with Google" button redirects you to a "Page not found." This is an issue that our forum vendors are working on fixing. In the meantime, while on the "Page not found" you can edit the URL to delete the second gatk, firecloud, or wdl (depending on what subforum you are acessing).
ex: https://gatkforums.broadinstitute.org/gatk/gatk/entry/...
Trouble with an intermediate gather step

Hi,
I'm working to migrate a pipeline from Snakemake to WDL so it can run on Firecloud. Unfortunately, I'm having difficulty migrating a specific bit of syntax and I was hoping for some advice.
The workflow in question does the following:
1) First, it runs a processing script on each of a set of VCFs.
2) Next, it computes a set of summary statistics from the aggregated processed VCFs.
3) Finally, it uses the summary statistics to apply a filter to each of the processed VCFs.
In Snakemake, this looks as follows:
with open('samples.list') as slist:
SAMPLES = [s.strip() for s in slist.readlines()]
rule all:
input:
expand('filtered_vcfs/{sample}.vcf', sample=SAMPLES)
rule process_vcf:
input:
'raw_vcfs/{sample}.vcf'
output:
'processed_vcfs/{sample}.vcf'
shell:
"./scripts/process_vcf.py {input} {output}"
rule compute_summary_stats:
input:
expand('processed_vcfs/{sample}.vcf', sample=SAMPLES)
output:
'stats/summary.txt'
script:
"scripts/compute_summary_stats.py"
rule filter_vcf:
input:
vcf='processed_vcfs/{sample}.vcf',
stats='stats/summary.txt'
output:
'filtered_vcfs/{sample}.vcf'
shell:
"./scripts/filter_vcf.py {input.vcf} {input.stats} {output}"
Which translates to the below graph of rule dependencies
However, I'm having difficulty translating this into WDL. I've set up a workflow that takes a tab delimited file of sample IDs and VCF filepaths as input. I can run a scatter
over the resulting array, then call a task that gathers the results to compute the aggregated summary stats.
workflow preprocess {
File vcflist
Array[Array[String]] vcfs = read_tsv(vcflist)
scatter (vcf in vcfs) {
call process_vcf {
input:
sample=vcf[0],
raw_vcf=vcf[1],
}
}
call compute_summary_stats {
input:
vcfs=process_vcf.processed_vcf,
}
}
I'm stuck trying to implement the filter step. The task needs to take both the filepath to the processed VCF and the sample ID as input, e.g.
task filter_vcf {
File vcf
File stats
String sample
command {
./scripts/filter_vcf.py ${vcf} ${stats} ${sample}.vcf
}
output {
File filtered_vcf="${sample}.vcf"
}
}
Is there a clean way to do this? I can run a scatter
over the output of process_vcf
, but it loses the associated sample IDs. Does gathering the results of a scatter
preserve the input ordering? I was thinking if I could select the first entry of every array in vcfs
I could zip
that with process_vcf.processed_vcf
, but I couldn't find the syntax for that in the WDL spec (the equivalent to Python's [x[0] for x in xs]
).
Any help would be much appreciated!
Answers
One way to do this is to also return the sample from the
process_vcf
task. Then, you can add azip
andscatter
to the workflow. Something like:Thanks for the suggestion! I tried implementing that but received the following error:
I tried indexing into the pair instead, but received the following error:
Any suggestions? Thanks!
(Note: I'm running Cromwell 30.2)
EDIT: looks like
p.left
andp.right
are the correct syntax. It worked, thank you for the suggestion!This is probably a little too hard to find in the spec, but pair access is via
p.left
andp.right
Here's the
Pair
indexing spec for future reference: https://github.com/openwdl/wdl/blob/master/versions/draft-3/SPEC.md#pair-indexingSorry, that was my bad.