Possible to create multiple task outputs using variables?

Is it possible to create an output of type "File" with name "VCF_${Chrom}" as shown in the output section below?

task HaplotypeCallerERC {

  String SampleName
  File BamFile
  File BamIndex
  String Chrom

  command {
    ${Paths["java"]} -jar ${Paths["gatk"]} \
        -T HaplotypeCaller \
        -ERC GVCF \
        -R ${Paths["refFasta"]} \
        -I ${BamFile} \
        -L ${Chrom} \
        -o ${SampleName}_${Chrom}_rawLikelihoods.g.vcf
  }

  output {
    File GVCF_${Chrom} = "${SampleName}_${Chrom}_rawLikelihoods.g.vcf"
  }
}

After running HaploytypeCaller, I would like to run GenotypeGVCFs, parallelized by chromosome. However, when calling GenotypeGVCFs I would like to call only the relevant input data - i.e. only the files outputted by HaploytypeCaller for a particular chromosome.

If this is not possible, would you have an alternative suggestion for how to accomplish this?

One possibility could be to read in all the outputs of HaplotypeCaller with a standard output section as shown below and parallelize by chromosome, which would by construction not process input data from irrelevant chromosomes (for each call), however my thinking is that inputting a lot of unnecessary input data many times (as many times as there are chromosomes) might add runtime (please correct me if I'm wrong re increase in runtime with this approach!).

Output {
    File GVCF= "${SampleName}_${Chrom}_rawLikelihoods.g.vcf"
  }

Thanks a lot,

Alon

Best Answer

  • alongaloralongalor
    Accepted Answer

    I found a solution which I am currently testing

    import "VariantDiscoveryPerChrom.wdl" as sub
    
    workflow VariantDiscovery {
    
      meta {
            author: "Alon Galor"
      }
    
      parameter_meta {
           inputSamplesFile: "a file where SampleNames are tab-separated"
           gvcfSampleName: "the resultant vcf will be named ${gvcfSampleName}_rawVariants.vcf"
      }
    
      File inputSamplesFile
      Array[Array[File]] inputSamples = read_tsv(inputSamplesFile)
      String gvcfSampleName
    
      Map[String, String] paths = {
      "java": "/n/data1/hms/dbmi/park/alon/0_GATK_Tools/jdk1.8.0_45/bin/java",
      "gatk": "/n/data1/hms/dbmi/park/alon/0_GATK_Tools/GenomeAnalysisTK.jar",
      "refFasta": "/home/sl279/BiO/Install/GATK-bundle/2.8/b37/human_g1k_v37_decoy.fasta",
      "dbsnp": "/home/sl279/BiO/Install/GATK-bundle/2.8/b37/dbsnp_138.b37.vcf",
      "picard": "/n/data1/hms/dbmi/park/alon/0_GATK_Tools/picard.jar"}
      Array[String] chromosomes = ["1", "2", "3", "4", "5", "6", "7", "8", "9", "10",
        "11", "12", "13", "14", "15", "16", "17", "18", "19", "20", "21", "22", "X", "Y", "MT"]
    
      Map[String, Array[Int]] runtimeParams = {
      "runtime_Minutes": [50, 150], "cpus": [4, 4], "requested_Memory_Mb_Per_Core": [10000, 10000]}
      Map[String, Array[String]] runtimeParam = {"queue": ["short", "short"]}
    
      scatter (chromosome in chromosomes) {
        call sub.VariantDiscoveryPerChrom {
          input:
            wfInputSamples=inputSamples,
            wfGvcfSampleName=gvcfSampleName,
            wfChromosome=chromosome,
    
            wfPaths=paths,
            wfRuntimeParams=runtimeParams,
            wfRuntimeParam=runtimeParam
        }
      }
    }
    
    import "HaplotypeCallerERC.wdl" as HC
    import "GenotypeGVCFs.wdl" as GG
    
    workflow VariantDiscoveryPerChrom {
    
      Array[Array[File]] wfInputSamples
      String wfGvcfSampleName
      String wfChromosome
    
      Map[String, String] wfPaths
      Map[String, Array[Int]] wfRuntimeParams
      Map[String, Array[String]] wfRuntimeParam
    
      scatter (wfInputSample in wfInputSamples) {
        call HC.HaplotypeCallerERC {
          input:
            SampleName=wfInputSample[0],
            BamFile=wfInputSample[1],
            BamIndex=wfInputSample[2],
            Chromosome=wfChromosome,
    
            Paths=wfPaths,
            RuntimeParams=wfRuntimeParams,
            RuntimeParam=wfRuntimeParam
        }
      }
    
      call GG.GenotypeGVCFs {
        input:
          GVCFs=HaplotypeCallerERC.GVCF,
          GvcfSampleName=wfGvcfSampleName,
          Chromosome=wfChromosome,
    
          Paths=wfPaths,
          RuntimeParams=wfRuntimeParams,
          RuntimeParam=wfRuntimeParam
      }
    }
    

Answers

  • ChrisLChrisL Cambridge, MAMember, Broadie, Moderator, Dev ✭✭

    Hey @alongalor - when you specify a value name that does have to be constant (just like in a programming language where the variable names have to be fully known to the compiler before the program starts running).

    it might help to see a small example of a workflow that demonstrates what you're trying to do. I'm sure that there will be a good way to do this!

  • alongaloralongalor Member
    edited September 2017

    Hey Chris, thanks for your quick reply.

    What I am trying to do is create a Variant Discovery pipeline (my current, working version is attached below). However, differently from how the pipeline works right now, as I said in my previous post, I would like to run HaploytypeCaller, and then GenotypeGVCFs, both parallelized by chromosome. The caveat here is that each time GenotypeGVCFs is called, I would like Genotype GVCFs to input just the files that correspond to the chromosome of that parallelized run.

    More specifically, I'd like to be able to do something like this (see GVCFs=HaplotypeCallerERC.VCF_${chromosome})

    Note: ignore the nested scatter - I have included it for brevity. In my script I effectively use a nested scatter by scattering a sub workflow with a scattered call

    workflow jointCallingGenotypes {
    
      File inputSamplesFile
      Array[Array[File]] inputSamples = read_tsv(inputSamplesFile)
      Array[String] chromosomes = ["1", "2", "3", "4", "5", "6", "7", "8", "9", "10",
        "11", "12", "13", "14", "15", "16", "17", "18", "19", "20", "21", "22", "X", "Y", "MT"]
    
      scatter (sample in inputSamples) {
        scatter (chromosome in chromosomes) {
          call HaplotypeCallerERC {
            input:
              wfInputSample=sample,
              Chromosomes=chromosomes
        }
      }
    
      scatter (chromosome in chromosomes) {
        call GenotypeGVCFs {
          input:
            GvcfSampleName=gvcfSampleName,
            GVCFs=HaplotypeCallerERC.VCF_${chromosome},
            Chrom=chromosome
        }
      }
    

    Thanks a lot,

    Alon

  • alongaloralongalor Member
    Accepted Answer

    I found a solution which I am currently testing

    import "VariantDiscoveryPerChrom.wdl" as sub
    
    workflow VariantDiscovery {
    
      meta {
            author: "Alon Galor"
      }
    
      parameter_meta {
           inputSamplesFile: "a file where SampleNames are tab-separated"
           gvcfSampleName: "the resultant vcf will be named ${gvcfSampleName}_rawVariants.vcf"
      }
    
      File inputSamplesFile
      Array[Array[File]] inputSamples = read_tsv(inputSamplesFile)
      String gvcfSampleName
    
      Map[String, String] paths = {
      "java": "/n/data1/hms/dbmi/park/alon/0_GATK_Tools/jdk1.8.0_45/bin/java",
      "gatk": "/n/data1/hms/dbmi/park/alon/0_GATK_Tools/GenomeAnalysisTK.jar",
      "refFasta": "/home/sl279/BiO/Install/GATK-bundle/2.8/b37/human_g1k_v37_decoy.fasta",
      "dbsnp": "/home/sl279/BiO/Install/GATK-bundle/2.8/b37/dbsnp_138.b37.vcf",
      "picard": "/n/data1/hms/dbmi/park/alon/0_GATK_Tools/picard.jar"}
      Array[String] chromosomes = ["1", "2", "3", "4", "5", "6", "7", "8", "9", "10",
        "11", "12", "13", "14", "15", "16", "17", "18", "19", "20", "21", "22", "X", "Y", "MT"]
    
      Map[String, Array[Int]] runtimeParams = {
      "runtime_Minutes": [50, 150], "cpus": [4, 4], "requested_Memory_Mb_Per_Core": [10000, 10000]}
      Map[String, Array[String]] runtimeParam = {"queue": ["short", "short"]}
    
      scatter (chromosome in chromosomes) {
        call sub.VariantDiscoveryPerChrom {
          input:
            wfInputSamples=inputSamples,
            wfGvcfSampleName=gvcfSampleName,
            wfChromosome=chromosome,
    
            wfPaths=paths,
            wfRuntimeParams=runtimeParams,
            wfRuntimeParam=runtimeParam
        }
      }
    }
    
    import "HaplotypeCallerERC.wdl" as HC
    import "GenotypeGVCFs.wdl" as GG
    
    workflow VariantDiscoveryPerChrom {
    
      Array[Array[File]] wfInputSamples
      String wfGvcfSampleName
      String wfChromosome
    
      Map[String, String] wfPaths
      Map[String, Array[Int]] wfRuntimeParams
      Map[String, Array[String]] wfRuntimeParam
    
      scatter (wfInputSample in wfInputSamples) {
        call HC.HaplotypeCallerERC {
          input:
            SampleName=wfInputSample[0],
            BamFile=wfInputSample[1],
            BamIndex=wfInputSample[2],
            Chromosome=wfChromosome,
    
            Paths=wfPaths,
            RuntimeParams=wfRuntimeParams,
            RuntimeParam=wfRuntimeParam
        }
      }
    
      call GG.GenotypeGVCFs {
        input:
          GVCFs=HaplotypeCallerERC.GVCF,
          GvcfSampleName=wfGvcfSampleName,
          Chromosome=wfChromosome,
    
          Paths=wfPaths,
          RuntimeParams=wfRuntimeParams,
          RuntimeParam=wfRuntimeParam
      }
    }
    
Sign In or Register to comment.