Heads up:
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
Update: July 26, 2019
This section of the forum is now closed; we are working on a new support model for WDL that we will share here shortly. For Cromwell-specific issues, see the Cromwell docs and post questions on Github.

How to identify different members of scatter output array?

I'm trying to run more commands on the output array of a scatter function. How can I call each file of the array separately as an input for the following function? All what I could find is how to have the array output gathered as a single input, using different formats of "sep" syntax.

Best Answer

  • Accepted Answer

    It seems you have not yet embraced the full power of the dark side of the FORCE cromwell. It is cromwells task to keep track of different files, so that you as a user don't have to mess about with sample numbers. This is an excerpt of the workflow I use to trim and map reads.

    workflow trimAndMap {
        File inputFastqFile
        Int nrCores
    
        Array[Array[File]] inputFastq = read_tsv(inputFastqFile)
    
        scatter (sample in inputFastq) {
            call trimmomatic {
                input:  samplename  = sample[0],
                        forward     = sample[1],
                        reverse     = sample[2],
                        nrCores     = nrCores
            }
            call map {
                input:  forward     = trimmomatic.forward,
                        reverse     = trimmomatic.reverse,
                        samplename  = sample[0],
                        nrCores     = nrCores
            }
        }
    }
    

    This way, cromwell will make sure that the name samplename1 will stay associated with the correct forward and reverse files, even after trimming. I then use the samplename in the output filename for the mapping, but that is only so I can easily recognize the files after the analysis has ran. I never use that filename within cromwell or wdl to identify which files belong to which sample. That is something that cromwell does for me automatically.

    The inputFastqFile looks like this:

    samplename1 /path/to/forward.fastq.gz   /path/to/reverse.fastq.gz
    sample2 /path/to/forward2.fastq.gz  /path/to/reverse2.fastq.gz
    

Answers

  • The easiest way is to just add those calls to the scatter function as well. eg

    workflow wf {
        scatter (sample in samples) {
            call trim {input: sample=sample}
            call map {input: sample=trim.sample}
        }
    }
    
  • alphahmedalphahmed JAPANMember
    edited May 2017

    Thank you Redmar for the answer, but the workflow I am running is a bit more complicated. Can you please help me with it? The first task is already a pipeline of multiple steps which has output that I need to use with the next task (fastqToSam, then markIlluminaAdapter) followed by (samToFastq, BWA then MergeBamAlignment). The Sample number is very important identifier for different files.

    I'm assigning the sample_number from the samples_input txt file, and then using the sample-number in the syntax of picard and GATK tools inputs and outputs. Therefore, the outputs have the ${Sample_number} as part of the output file name.

    When I try to run the next task under the same scatter function as you suggested, I need to define the input for the next task when I call it. I can still use the Sample_number as an input, but I can't use the ${Sample_number} as part of the input line (e.g. sample=trim.trimmed_${Sample_number}).

  • Redmar_van_den_BergRedmar_van_den_Berg Member ✭✭
    Accepted Answer

    It seems you have not yet embraced the full power of the dark side of the FORCE cromwell. It is cromwells task to keep track of different files, so that you as a user don't have to mess about with sample numbers. This is an excerpt of the workflow I use to trim and map reads.

    workflow trimAndMap {
        File inputFastqFile
        Int nrCores
    
        Array[Array[File]] inputFastq = read_tsv(inputFastqFile)
    
        scatter (sample in inputFastq) {
            call trimmomatic {
                input:  samplename  = sample[0],
                        forward     = sample[1],
                        reverse     = sample[2],
                        nrCores     = nrCores
            }
            call map {
                input:  forward     = trimmomatic.forward,
                        reverse     = trimmomatic.reverse,
                        samplename  = sample[0],
                        nrCores     = nrCores
            }
        }
    }
    

    This way, cromwell will make sure that the name samplename1 will stay associated with the correct forward and reverse files, even after trimming. I then use the samplename in the output filename for the mapping, but that is only so I can easily recognize the files after the analysis has ran. I never use that filename within cromwell or wdl to identify which files belong to which sample. That is something that cromwell does for me automatically.

    The inputFastqFile looks like this:

    samplename1 /path/to/forward.fastq.gz   /path/to/reverse.fastq.gz
    sample2 /path/to/forward2.fastq.gz  /path/to/reverse2.fastq.gz
    
  • alphahmedalphahmed JAPANMember
    edited May 2017

    Redmar! Thanks a lot. I've just had an awesome taste of cromwell's power. I've never thought that it's possible.

    I appreciate your help, can you point me to where I can learn more about the dark side of the FORCE cromwell's power?

  • RuchiRuchi Member, Broadie, Moderator, Dev admin

    I would definitely start with some of the WDL tutorials if you haven't explored them yet. Other resources that could be explored...

    1. WDL spec, although its a tad bit long, it does show which features are supported by the most current version of Cromwell.
    2. An example WDL that utilizes many of the GATK tools you reference.

    Are there specific Cromwell options you're interested to learn about?

  • alphahmedalphahmed JAPANMember
    edited May 2017

    Thank you @Ruchi !
    I've completed the tutorials; they were really informative. The example WDL is definitely a comprehensive way of learning about the different behaviors of Cromwell.

    I'm working on developing pipelines for servers that deal with multiple samples. I'm also trying to implement some loops within analyses, which might make it a bit sophisticated.

    I need to know as much as possible about Cromwell to do additional work on parallel computing and enhance the efficiency of pipelines.

    Thanks again! I appreciate your kind help.

Sign In or Register to comment.