To celebrate the release of GATK 4.0, we are giving away free credits for running the GATK4 Best Practices pipelines in FireCloud, our secure online analysis portal. It’s first come first serve, so sign up now to claim your free credits worth $250. Sponsored by Google Cloud. Learn more at https://software.broadinstitute.org/firecloud/documentation/freecredits

How to convert multiple bwa mem tasks into a single scatter-gather code block

YatrosYatros Seattle, WA, USAMember
edited September 2017 in Ask the WDL team

Given the following WDL script:

workflow Mapping {

  String bin_path
  String wdir
  String sample_name
  String threads
  File human_g1k_v37_decoy_fa
  File human_g1k_v37_decoy_fai
  File human_g1k_v37_decoy_bwt
  File human_g1k_v37_decoy_pac
  File human_g1k_v37_decoy_ann
  File human_g1k_v37_decoy_amb
  File human_g1k_v37_decoy_sa

  call getnames {
    input: wdir=wdir
    }

  call bwa1 {
    input: sample_name=sample_name, threads=threads, human_g1k_v37_decoy_fa=human_g1k_v37_decoy_fa, human_g1k_v37_decoy_fai=human_g1k_v37_decoy_fai, human_g1k_v37_decoy_bwt=human_g1k_v37_decoy_bwt, human_g1k_v37_decoy_pac=human_g1k_v37_decoy_pac, human_g1k_v37_decoy_ann=human_g1k_v37_decoy_ann, human_g1k_v37_decoy_amb=human_g1k_v37_decoy_amb, human_g1k_v37_decoy_sa=human_g1k_v37_decoy_sa
    }

  call bwa2 {
    input: sample_name=sample_name, threads=threads, human_g1k_v37_decoy_fa=human_g1k_v37_decoy_fa, human_g1k_v37_decoy_fai=human_g1k_v37_decoy_fai, human_g1k_v37_decoy_bwt=human_g1k_v37_decoy_bwt, human_g1k_v37_decoy_pac=human_g1k_v37_decoy_pac, human_g1k_v37_decoy_ann=human_g1k_v37_decoy_ann, human_g1k_v37_decoy_amb=human_g1k_v37_decoy_amb, human_g1k_v37_decoy_sa=human_g1k_v37_decoy_sa
    }

   call mergesamfiles {
    input: bin_path=bin_path, sample_name=sample_name, in1=bwa1.out, in2=bwa2.out
   }

  }

  task getnames {
      String wdir
      command {readlink -f ${wdir}/*.fastq.gz}
      output {File getnames = stdout ()}   
  }

  task bwa1 {
      String sample_name
      String threads
      File human_g1k_v37_decoy_fa
      File human_g1k_v37_decoy_fai
      File human_g1k_v37_decoy_bwt
      File human_g1k_v37_decoy_pac
      File human_g1k_v37_decoy_ann
      File human_g1k_v37_decoy_amb
      File human_g1k_v37_decoy_sa
      String lane
      String ID
      String SM
      String PL
      String LB
      String PU
      File R1
      File R2
      command {bwa mem -M -t ${threads} -R '@RG\tID:${ID}\tSM:${SM}\tPL:${PL}\tLB:${LB}\tPU:${PU}' ${human_g1k_v37_decoy_fa} ${R1} ${R2} > ${sample_name}_${lane}.sam}
      output {File out ="${sample_name}_${lane}.sam"
    }
  }
  task bwa2 {
      String sample_name
      String threads
      File human_g1k_v37_decoy_fa
      File human_g1k_v37_decoy_fai
      File human_g1k_v37_decoy_bwt
      File human_g1k_v37_decoy_pac
      File human_g1k_v37_decoy_ann
      File human_g1k_v37_decoy_amb
      File human_g1k_v37_decoy_sa
      String lane
      String ID
      String SM
      String PL
      String LB
      String PU
      File R1
      File R2
      command {bwa mem -M -t ${threads} -R '@RG\tID:${ID}\tSM:${SM}\tPL:${PL}\tLB:${LB}\tPU:${PU}' ${human_g1k_v37_decoy_fa} ${R1} ${R2} > ${sample_name}_${lane}.sam}
      output {File out ="${sample_name}_${lane}.sam"
  }
 }

 task mergesamfiles{
      String bin_path
      String sample_name
      File in1
      File in2
      command {java -jar ${bin_path}/picard.jar MergeSamFiles I=${in1} I=${in2} O=${sample_name}_merged.sam}
      output {File out = "${sample_name}_merged.sam"}
 }

The getnames task returns a stdout file with the following format:
/path/to/file/samplename_TTAACGTC_L001_R1_001.fastq.gz
/path/to/file/samplename_TTAACGTC_L001_R2_001.fastq.gz
/path/to/file/samplename_CTCTCTAC_L002_R1_001.fastq.gz
/path/to/file/samplename_CTCTCTAC_L002_R2_001.fastq.gz

How can I use this stdout information to make an array that can be used in a scatter block code? The files are paired end read files for the same sample that have been run in different lanes (L001, L002 ....). I would like to map the two files that belong to each lane together to generate independent sam files. The reference files are the sames for all fastq files, but the RG information should be customized for each single lane where the sample was run.

I have read the documentation https://software.broadinstitute.org/wdl/documentation/plumbing.php#scatter, but I'm not able to figure it out on my own. Specially the pair file matching according to the L001 / L002 / string and assigning different RG values depending on the lane where the sample was run.

I'm using wdltool-0.14 and cromwell 29 locally in Linux CentOS 7 with java openjdk version "1.8.0_141".

Thank you very much,

Best,

Yatros

Best Answers

Answers

  • YatrosYatros Seattle, WA, USAMember

    Hello,

    I have modified my initial wdl code with two awk statements so that my stdout file is tab delimited and lacks the entire path.

    workflow Mapping {
    
      String bin_path
      String wdir
      String sample_name
      String threads
      File human_g1k_v37_decoy_fa
      File human_g1k_v37_decoy_fai
      File human_g1k_v37_decoy_bwt
      File human_g1k_v37_decoy_pac
      File human_g1k_v37_decoy_ann
      File human_g1k_v37_decoy_amb
      File human_g1k_v37_decoy_sa
    
    #   File inputSamplesFile
    #   Array[Array[String]] inputSamples = read_tsv(inputSamplesFile)
    
        call readlink {input: wdir=wdir}
    
        call awk1 {input: in1=readlink.out}
    
        call awk2 {input: in1=awk1.out}
    
        call bwa1 {
        input: sample_name=sample_name, threads=threads, human_g1k_v37_decoy_fa=human_g1k_v37_decoy_fa, human_g1k_v37_decoy_fai=human_g1k_v37_decoy_fai, human_g1k_v37_decoy_bwt=human_g1k_v37_decoy_bwt, human_g1k_v37_decoy_pac=human_g1k_v37_decoy_pac, human_g1k_v37_decoy_ann=human_g1k_v37_decoy_ann, human_g1k_v37_decoy_amb=human_g1k_v37_decoy_amb, human_g1k_v37_decoy_sa=human_g1k_v37_decoy_sa
        }
    
        call bwa2 {
        input: sample_name=sample_name, threads=threads, human_g1k_v37_decoy_fa=human_g1k_v37_decoy_fa, human_g1k_v37_decoy_fai=human_g1k_v37_decoy_fai, human_g1k_v37_decoy_bwt=human_g1k_v37_decoy_bwt, human_g1k_v37_decoy_pac=human_g1k_v37_decoy_pac, human_g1k_v37_decoy_ann=human_g1k_v37_decoy_ann, human_g1k_v37_decoy_amb=human_g1k_v37_decoy_amb, human_g1k_v37_decoy_sa=human_g1k_v37_decoy_sa
        }
    
        call mergesamfiles {
        input: bin_path=bin_path, sample_name=sample_name, in1=bwa1.out, in2=bwa2.out
       }
    
    }
    
        task readlink {
            String wdir
            command {readlink -f ${wdir}/*.fastq.gz}
            output {File out = stdout()}
        }
    
        task awk1 {
            File in1
            command <<<awk -F/ '{print $NF}' ${in1}>>>
            output {File out = stdout()}
        }
    
        task awk2 {
            File in1
            command <<<awk -vRS="\n" -vORS="\t" '1' ${in1}>>>
            output {File out = stdout()}
        }
    
    
      task bwa1 {
          String sample_name
          String threads
          File human_g1k_v37_decoy_fa
          File human_g1k_v37_decoy_fai
          File human_g1k_v37_decoy_bwt
          File human_g1k_v37_decoy_pac
          File human_g1k_v37_decoy_ann
          File human_g1k_v37_decoy_amb
          File human_g1k_v37_decoy_sa
          String lane
          String ID
          String SM
          String PL
          String LB
          String PU
          File R1
          File R2
          command {bwa mem -M -t ${threads} -R '@RG\tID:${ID}\tSM:${SM}\tPL:${PL}\tLB:${LB}\tPU:${PU}' ${human_g1k_v37_decoy_fa} ${R1} ${R2} > ${sample_name}_${lane}.sam}
          output {File out ="${sample_name}_${lane}.sam"
        }
      }
      task bwa2 {
          String sample_name
          String threads
          File human_g1k_v37_decoy_fa
          File human_g1k_v37_decoy_fai
          File human_g1k_v37_decoy_bwt
          File human_g1k_v37_decoy_pac
          File human_g1k_v37_decoy_ann
          File human_g1k_v37_decoy_amb
          File human_g1k_v37_decoy_sa
          String lane
          String ID
          String SM
          String PL
          String LB
          String PU
          File R1
          File R2
          command {bwa mem -M -t ${threads} -R '@RG\tID:${ID}\tSM:${SM}\tPL:${PL}\tLB:${LB}\tPU:${PU}' ${human_g1k_v37_decoy_fa} ${R1} ${R2} > ${sample_name}_${lane}.sam}
          output {File out ="${sample_name}_${lane}.sam"
      }
     }
    
     task mergesamfiles{
          String bin_path
          String sample_name
          File in1
          File in2
          command {java -jar ${bin_path}/picard.jar MergeSamFiles I=${in1} I=${in2} O=${sample_name}_merged.sam}
          output {File out = "${sample_name}_merged.sam"}
     }
    

    The current stdout file looks like this:
    TTAACGTC_L001_R1_001.fastq.gz TTAACGTC_L001_R2_001.fastq.gz CTCTCTAC_L002_R1_001.fastq.gz CTCTCTAC_L002_R2_001.fastq.gz

    However, I'm still trying to convert this file into an array and run each pair of files matching them by lanes in a single scatter code block.

    Any help would be appreciated.

    Best,

    Yatros

  • YatrosYatros Seattle, WA, USAMember

    Dear @ChrisL ,

    With the help of your comments, I have been able to simplify and improve my wdl code. It looks like this now:

    workflow Mapping {
        String wdir
        String sample_name
        String threads
        File human_g1k_v37_decoy_fa
        File human_g1k_v37_decoy_fai
        File human_g1k_v37_decoy_bwt
        File human_g1k_v37_decoy_pac
        File human_g1k_v37_decoy_ann
        File human_g1k_v37_decoy_amb
        File human_g1k_v37_decoy_sa
    
        File file_of_file_names
        Array[File] inputs = read_lines(file_of_file_names)
    
        scatter(i in range(length(inputs) / 2)) {
        Int index0 = i * 2
        Int index1 = i * 2 + 1
        File item0 = inputs[index0]
        File item1 = inputs[index1]
    
        call bwa {input: sample_name=sample_name, threads=threads, human_g1k_v37_decoy_fa=human_g1k_v37_decoy_fa, human_g1k_v37_decoy_fai=human_g1k_v37_decoy_fai, human_g1k_v37_decoy_bwt=human_g1k_v37_decoy_bwt, human_g1k_v37_decoy_pac=human_g1k_v37_decoy_pac, human_g1k_v37_decoy_ann=human_g1k_v37_decoy_ann, human_g1k_v37_decoy_amb=human_g1k_v37_decoy_amb, human_g1k_v37_decoy_sa=human_g1k_v37_decoy_sa, R1 = item0, R2 = item1}
    
        }
    
        call readlink {input: wdir=wdir}
    
        call awk1 {input: in1=readlink.out}
    
        call mergesamfiles {input: sample_name=sample_name, BWA_out = bwa.out}
    
    }
    
        task readlink {
            String wdir
            command {readlink -f ${wdir}/*.fastq.gz}
            output {File out = stdout()}
        }
    
        task awk1 {
            File in1
            command <<<awk -F/ '{print $NF}' ${in1}>>>
            output {File out = stdout()}
        }
    
        task bwa {
            String sample_name
            String threads
            File human_g1k_v37_decoy_fa
            File human_g1k_v37_decoy_fai
            File human_g1k_v37_decoy_bwt
            File human_g1k_v37_decoy_pac
            File human_g1k_v37_decoy_ann
            File human_g1k_v37_decoy_amb
            File human_g1k_v37_decoy_sa
            String lane
            String ID
            String SM
            String PL
            String LB
            String PU
            File R1
            File R2
            command {bwa mem -M -t ${threads} -R '@RG\tID:${ID}\tSM:${SM}\tPL:${PL}\tLB:${LB}\tPU:${PU}' ${human_g1k_v37_decoy_fa} ${R1} ${R2} > ${sample_name}_${lane}.sam}
            output {File out ="${sample_name}_${lane}.sam"}
        }
    
        task mergesamfiles{
            Array[File] BWA_out
            String sample_name
            command {java -jar $PICARD MergeSamFiles I=${sep=' I=' BWA_out} O=${sample_name}_merged.sam}
            output {File out = "${sample_name}_merged.sam"}
        }
    

    However, there are two issues that I cannot fix:

    • The WDL execution model means that it's slightly dangerous to be using directories as inputs, especially as Strings. If you tried to port this into a cloud or docker environment then the workflow won't work as the engine won't know that the String represents something that needs to be copied before running the task.

    1.- I want to get rid of the readlink and awk1 tasks. I just need a command that lists the contents of the original folder where you are running the WDL script, so that I can get a list of the fastq.gz files contained in the original folder. I have tried to look for it, but I haven't found it. This should be an easy fix though.

    2.- I would like to parse an argument/file to the bwa task so that I can assign different values for the @RG parameters ${lane}, ${ID}, ${SM}, ${PL}, ${LB} and ${PU} every time two new fastq files are processed, so that each sam file has its correct @RG header line and the output file is named correctly. Right now all output files have the same @RG information and they have the same output name, because I'm parsing the information via the "Mapping.bwa." json file argument.

    Thank you very much for your help,

    Best,

    Yatros

Sign In or Register to comment.