How can I reference the outputs of a task in an if block?

gauthiergauthier Member, Broadie, Moderator, Dev

I'm trying to write a workflow along the lines of X->Y->Z where X and Y are in a scatter and Y is conditional on a workflow input. Z expects an Array[File] but since Y is in an if block it's output is giving me a Array[File?] and I'm getting a coercion error. What's the right way to do this?

Specifically, my error is No coercion defined from [Yshard1, Yshard2, Yshard3...] of type 'Array[File?]' to 'Array[File]'."

The relevant part of the WDL looks like this:

# Call variants in parallel over WGS calling intervals
  scatter (index in range(ScatterIntervalList.interval_count)) {
    # Generate GVCF by interval
    call HaplotypeCaller {
      input:
        contamination = CheckContamination.contamination,
        input_bam = GatherBamFiles.output_bam,
        interval_list = ScatterIntervalList.out[index],
        gvcf_basename = base_file_name,
        genotype_and_filter = genotype_and_filter,
        ref_dict = ref_dict,
        ref_fasta = ref_fasta,
        ref_fasta_index = ref_fasta_index,
        # Divide the total output GVCF size and the input bam size to account for the smaller scattered input and output.
        disk_size = ((binned_qual_bam_size + GVCF_disk_size) / hc_divisor) + ref_size + additional_disk,
        preemptible_tries = agg_preemptible_tries
     }
    if (do_filtering) {
      call FilterVcf {
        input:
          input_vcf = HaplotypeCaller.output_gvcf,
          input_vcf_index = HaplotypeCaller.output_gvcf_index,
          gvcf_basename = base_file_name,
          interval_list = ScatterIntervalList.out[index],
          gvcf_basename = base_file_name,
          # The output here should be the same size
          disk_size = ((binned_qual_bam_size + GVCF_disk_size) / hc_divisor) + ref_size + additional_disk,
          preemptible_tries = preemptible_tries
      }
    }
  }

  Array[File] merge_input = select_first([FilterVcf.output_vcf, HaplotypeCaller.output_gvcf])
  Array[File] merge_input_index = select_first([FilterVcf.output_vcf_index, HaplotypeCaller.output_gvcf_index])
  String name_token = if do_filtering then ".filtered" else ".g"

  # Combine by-interval GVCFs into a single sample GVCF file
  call MergeVCFs {
    input:
      input_vcfs = merge_input,
      input_vcfs_indexes = merge_input_index,
      output_vcf_name = final_gvcf_base_name + name_token + ".vcf.gz",
      disk_size = GVCF_disk_size,
      preemptible_tries = agg_preemptible_tries
  }

Answers

  • ChrisLChrisL Cambridge, MAMember, Broadie, Moderator, Dev
    edited January 25

    The way to go from Array[X?] to Array[X] is using the function select_all(), which picks out only the values in the array of optionals that are set.

    In your example, I'd guess this would go in somewhere like:

    Array[File] merge_input = select_first([select_all(FilterVcf.output_vcf), HaplotypeCaller.output_gvcf])
    Array[File] merge_input_index = select_first([select_all(FilterVcf.output_vcf_index), HaplotypeCaller.output_gvcf_index])
    

    EDIT: I think I misread your WDL.

    It looks like you want something like this (I simplified the names, hopefully it's clear what they map back to):

    scatter {
      call X # produces File X.f
      if {
        call Y # produces File Y.f
      }
      File f = select_first(Y,f, X,f)
    }
    # Gather the Files in the normal way:
    Array[File] merge_input = f
    
    
  • RuchiRuchi Member, Broadie, Moderator, Dev

    @gauthier I believe one way to get what you need is to try this:

    # Call variants in parallel over WGS calling intervals
      scatter (index in range(ScatterIntervalList.interval_count)) {
        # Generate GVCF by interval
        call HaplotypeCaller {
          input:
            contamination = CheckContamination.contamination,
            input_bam = GatherBamFiles.output_bam,
            interval_list = ScatterIntervalList.out[index],
            gvcf_basename = base_file_name,
            genotype_and_filter = genotype_and_filter,
            ref_dict = ref_dict,
            ref_fasta = ref_fasta,
            ref_fasta_index = ref_fasta_index,
            # Divide the total output GVCF size and the input bam size to account for the smaller scattered input and output.
            disk_size = ((binned_qual_bam_size + GVCF_disk_size) / hc_divisor) + ref_size + additional_disk,
            preemptible_tries = agg_preemptible_tries
         }
        if (do_filtering) {
          call FilterVcf {
            input:
              input_vcf = HaplotypeCaller.output_gvcf,
              input_vcf_index = HaplotypeCaller.output_gvcf_index,
              gvcf_basename = base_file_name,
              interval_list = ScatterIntervalList.out[index],
              gvcf_basename = base_file_name,
              # The output here should be the same size
              disk_size = ((binned_qual_bam_size + GVCF_disk_size) / hc_divisor) + ref_size + additional_disk,
              preemptible_tries = preemptible_tries
          }
        }
    
        File final_vcf = select_first([FilterVcf.output_vcf, HaplotypeCaller.output_gvcf])
        File final_vcf_idx = select_first([FilterVcf.output_vcf_index, HaplotypeCaller.output_gvcf_index])
      }
    
      Array[File] merge_input = final_vcf
      Array[File] merge_input_index = final_vcf_index
      String name_token = if do_filtering then ".filtered" else ".g"
    
      # Combine by-interval GVCFs into a single sample GVCF file
      call MergeVCFs {
        input:
          input_vcfs = merge_input,
          input_vcfs_indexes = merge_input_index,
          output_vcf_name = final_gvcf_base_name + name_token + ".vcf.gz",
          disk_size = GVCF_disk_size,
          preemptible_tries = agg_preemptible_tries
      }
    
    
Sign In or Register to comment.