Forum Login Issue:
Currently the "Log in with Google" button redirects you to a "Page not found." This is an issue that our forum vendors are working on fixing. In the meantime, while on the "Page not found" you can edit the URL to delete the second gatk, firecloud, or wdl (depending on what subforum you are acessing).
ex: https://gatkforums.broadinstitute.org/gatk/gatk/entry/...

How do I loop over a set of input files?

oskarvoskarv BergenMember
edited October 2017 in Ask the WDL team

Let's say I have two input files that I want to loop over serially, i.e not in parallel with a scatter/gather loop, how do I do that? I know it's possible to use while loops, although I haven't seen a code example so I don't know the syntax, and perhaps it's not fully supported yet? But it would be possible to hack together a loop I suppose, e.g if you index your files and for each iteration you increase the index by one until all files have been looped over. But perhaps it's possible to hack the scatter/gather function as well?

The reason I want to loop rather than scatter/gather is to optimize my pipeline, at the moment I'm wasting resources since I'm forced to use a suboptimal number of scatters due to hardware limitations. The task I want to serialize can utilize all threads, but I can't use scatter/gather due to RAM constraints, and other tasks downstream use more CPU and less RAM. But since I'm forced to restrict the number of shards globally, every tool suffers.
If I could adjust which tasks in a scatter/gather block are serialized, and how many shards each scatter task is allowed to create, I could better utilize my resources. The current solution of using a one size (does not) fits all solution isn't quite doing it properly in my opinion.

Whether you implement such functionality or not is secondary right now though, I'm more interested in asking you if it's possible to loop over a set of input files, is it possible?

Edit: One idea would be to start a subworkflow and to restrict the number of max workflows to two, thus effectively looping over the files since only one subworkflow can start since there's already one workflow running. But I'd like to keep it as clean as possible and not use more scripts than necessary if it's possible, so I'd prefer to hack scatter/gather or use a while loop instead.

Best Answers

  • oskarvoskarv BergenMember
    Accepted Answer

    I have now created a for loop that can create an incrementally increased output name for cases when the input file name is identical for all input files, such as the output from a scatter/gather operation.

    task TaskName {
    Array[File] Inputs

        command {
        i=0 && \
        for file in ${sep=' ' Inputs}; do
                let "i++"
                toolName \
                --output FileName-$i.txt \
                --input $file
        done
        }
    
        output {
                Array[File] Output = glob("FileName-*.txt")
        }
    

    }

    This will create output files named FileName-1.txt, FileName-2.txt, FileName-3.txt etc.

    The call is as simple as:
    workflow WorkFlowName {
    Array[File] InputArrayFromJson

        call TaskName {
          input:
            Inputs = InputArrayFromJson, 
            (Inputs = PreviousToolName.Outputs,) #Or alternatively use this if your input is from e.g a scatter/gather operation.
          }
    

    }

Answers

  • oskarvoskarv BergenMember
    edited October 2017

    @ChrisL said:
    @oskarv unfortunately there's no while loop, and not any easy way to make scatter happen in serial. As you mention, you can add a concurrent-job-limit for your backend but then that applies to all jobs in the workflow (and indeed across all of your workflows) - a very blunt instrument.

    According to the documentation while loops are coming though? https://github.com/openwdl/wdl/blob/develop/SPEC.md#loops

    In case you're curious why we never really added very much control over scalability on the Local backend, the real answer here would be to spin out your jobs to a HPC cluster like SGE or LSF, or to a cloud compute environment like Google's pipelines API, to allow you to scale to any size without ever worrying about resource limits!

    If that's impossible for you, you could probably do what you want manually by expanding the loop longhand - if it's a known number of inputs (in your case 2?), eg something like:

    We work on single servers, so clusters or Google cloud unfortunately isn't an option.

    workflow foo {
      File f1
      File f2
      call t as t1 { input: f = f1, ready = true }
      call t as t2 { input: f = f2, ready = t1.done }
    }
    
    task t {
      Boolean ready
      File f
      command { ... }
      output {
        ...
        # As well as the normal outputs, use a boolean to indicate that we're done:
        Boolean done = true
      }
    }
    

    This is doing the trick for smaller runs and I'll probably use it for that, but it's not a very useful solution once you need to run more than four files. Perhaps I should just run a bash script to loop through the files, and then use the output files for the wdl script... That seems like the best solution at the moment. I think a proper for loop function in wdl is sorely needed though.

    I've written it here before, but I'll mention it again anyways, one solution is to allow the user to limit the number of jobs that are spawned in a given task inside of a scatter/gather operation. If you limit it to spawning one job then it's effectively a for loop, even though it strictly still is a scatter operation. And either way, allowing the user to locally limit the number of shards per task in a scatter operation is a useful feature in my opinion. It would override the global limit that is set in the application.conf file.

    Post edited by oskarv on
  • oskarvoskarv BergenMember

    @mmah said:
    Here are the non-WDL solutions I have used when running on a SLURM cluster, where job startup cost can be significant relative to the job runtime cost. In cases where there are many small jobs, it is more efficient not to scatter. On my cluster, the age of jobs in the queue is a major factor in the job priority, so there is a big incentive not to globally restrict the number of concurrent jobs.

    bash loop
    ....

    Very nice! I'm going to have to take a closer look at this, thank you!

  • oskarvoskarv BergenMember
    Accepted Answer

    I have now created a for loop that can create an incrementally increased output name for cases when the input file name is identical for all input files, such as the output from a scatter/gather operation.

    task TaskName {
    Array[File] Inputs

        command {
        i=0 && \
        for file in ${sep=' ' Inputs}; do
                let "i++"
                toolName \
                --output FileName-$i.txt \
                --input $file
        done
        }
    
        output {
                Array[File] Output = glob("FileName-*.txt")
        }
    

    }

    This will create output files named FileName-1.txt, FileName-2.txt, FileName-3.txt etc.

    The call is as simple as:
    workflow WorkFlowName {
    Array[File] InputArrayFromJson

        call TaskName {
          input:
            Inputs = InputArrayFromJson, 
            (Inputs = PreviousToolName.Outputs,) #Or alternatively use this if your input is from e.g a scatter/gather operation.
          }
    

    }

Sign In or Register to comment.