Use read_tsv to read in file names AND string parameters for use with scatter

dannykwellsdannykwells San FranciscoMember
edited September 2017 in Ask the Cromwell + WDL Team

Hi folks,

We are putting WDL/Cromwell into production with a Google Cloud backend - so far, it's been great! My question is this:

In one ideal setting, we would like to be able to specify a single .tsv file of both file names (say, input bams) and paramters (say, sample name), and scatter over the rows of this to run these samples in parallel. For example, the .tsv could look like this:

gs://<my-bucket>/tumor1.bam    gs://<my-bucket>/normal1.bam    gs://<my-bucket>/interval1.bed    SAMPLE_NAME1
gs://<my-bucket>/tumor2.bam    gs://<my-bucket>/normal2.bam    gs://<my-bucket>/interval2.bed    SAMPLE_NAME2

etc.

Now, we currently use a line like:

File inputSamplesFile
Array[Array[File]] inputSamples = read_tsv(inputSamplesFile)

to build an array of files, and then scatter over these using

scatter (sample in inputSamples) {
}

Since we use "Array[Array[File]] ", "inputSamplesFile" cannot have the column with the sample name in it (since SAMPLE_NAME) is a string.

So, my question is, is there a way to read in a file with mixed types (Files and Strings), and then be able to scatter over the rows like we are doing? Any help would be great.

Thanks!
-d

Best Answer

Answers

  • dannykwellsdannykwells San FranciscoMember

    Great, I will give this a shot and report back!

  • henderj8henderj8 Cincinnati Children's Hospital Member

    Hey @Ruchi,

    I am dealing with the same sort of issue that @dannykwells discussed above. I am trying to pass an array of samples that contains reads one and two of several fastq files and their associated sample ID. Unlike the above example, I am trying to pass this scatter to a subworkflow, in which I am running an analysis pipeline that involves another scatter function down the road. I am getting a 'Variable not found' error for my inputs I am defining from the array.

    Here is a snippet of code I am getting an error on:

    Main WDL:

    import "bwasub.wdl" as sub
    workflow align {
        File inputfastq
        Array[Array[String]] samples = read_tsv(inputfastq)
    
        scatter (sample in samples) {
            call sub.bwasub {
                input:
                    FASTQ = samples[1],
                    FASTQ2 = samples[2],
                    sample_id = samples[0]
            }
        }
    }
    

    Sub WDL:

    task bwamem {
    String sample_id
    File FASTQ
    File FASTQ2
    String reference
    File ref_fasta
    File ref_fasta_index
    File ref_dict
    File ref_amb
    File ref_ann
    File ref_bwt
    File ref_pac
    File ref_sa
    
    command {
    bwa mem -M -t 8 -j -R '@RG\tID:A\tLB:testlib\tPU:FCB05VTABXX\tSM:${sample_id}\tPL:ILLUMINA' ${reference} ${FASTQ} ${FASTQ2} > ${sample_id}.sam
    }   
    
    
    runtime { ... } 
    
    output { ... }
    }
    
    
    
    workflow bwasub {
    String reference
    File ref_fasta
    File ref_fasta_index
    File ref_dict
    File ref_amb
    File ref_ann
    File ref_bwt
    File ref_pac
    File ref_sa
    
        call bwamem {
            input:
            FASTQ = samples[1],
            FASTQ2 = samples[2],
            sample_id = samples[0],
            ref_fasta = ref_fasta,
            ref_fasta_index = ref_fasta_index,
            reference = reference,
            ref_dict = ref_dict,
            ref_amb = ref_amb,
            ref_ann = ref_ann,
            ref_bwt = ref_bwt,
            ref_pac = ref_pac,
            ref_sa = ref_sa
        }
    }
    

    Error:
    wdl4s.wdl.exception.VariableNotFoundException$$anon$1: Variable 'FASTQ' not found
    Variable 'FASTQ2' not found
    Variable 'sample_id' not found

  • RuchiRuchi Member, Broadie, Moderator, Dev

    Hey @henderj8

    I believe you just need to make a tiny change to the syntax inside the scatter block in the Main WDL, as you'd want to index sample to get Fastq/sample name details:

    samples[1] --> sample[1]
    samples[2] --> sample[2]
    samples[0] --> sample[0]

  • henderj8henderj8 Cincinnati Children's Hospital Member

    Unfortunately, I had it like that in the beginning and changed it to 'samples' to see if that was the issue and it was not. I just retested with 'sample' and I am still getting the same error which is head scratching. Could it be something with the subworkflow not correctly receiving/identifying the assigned inputs from the tsv file?

  • RuchiRuchi Member, Broadie, Moderator, Dev

    My apologies -- I just noticed another thing.

    You're passing in task level inputs but you're calling on the workflow and that's not allowed. You can either create workflow level variables in the bwasub and pass those to the task bwamem inside the sub itself.

    workflow bwasub {
    String reference
    File ref_fasta
    File ref_fasta_index
    File ref_dict
    File ref_amb
    File ref_ann
    File ref_bwt
    File ref_pac
    File ref_sa
    File FASTQ
    File FASTQ2
    String sample_id
    
        call bwamem {
            input:
            FASTQ = FASTQ,
            FASTQ2 = FASTQ2,
            sample_id = sample_id,
            ref_fasta = ref_fasta,
            ref_fasta_index = ref_fasta_index,
            reference = reference,
            ref_dict = ref_dict,
            ref_amb = ref_amb,
            ref_ann = ref_ann,
            ref_bwt = ref_bwt,
            ref_pac = ref_pac,
            ref_sa = ref_sa
        }
    }
    
  • henderj8henderj8 Cincinnati Children's Hospital Member

    Ah, I see that fixed my problem! Thank you very much for your help!

Sign In or Register to comment.