Scatter Gather on HaplotyperCaller - Scatter on locus or intervals ?

Hello,
I am now learning to use WDL esp for running Scatter Gather on HaplotyperCaller
1) In this example, https://software.broadinstitute.org/wdl/userguide/article?id=7614 , scatter gather is done on default setting of Haplotyer Caller (i.e. by locus)
2) On the other hand, in this example, https://github.com/broadinstitute/wdl/blob/develop/scripts/broad_pipelines/PublicPairedSingleSampleWf_160720.wdl, scatter gather was done on a specific interval.

Question 1) Is there an advantage of running scatter-gather over an interval instead of "by locus" ?
Question 2) Suppose I wanted to run scatter gather on HaplotypeCaller (to generate raw VCF file) and I want it to be scattered per chromosome, could you give me some advice on how I can do that ?

Best Answer

Answers

  • KateNKateN Cambridge, MAMember, Broadie, Moderator admin

    Scattering with WDL can be done over anything so long as you have an Array that contains the items you wish to scatter over. In the case of the first example you link, I scattered by sample, as it was the simplest way to demonstrate scattering for a tutorial.

    The second example you link scatters by interval, which is a faster way to run, as it scatters on a smaller section than the earlier by-sample example. If you wished to scatter per chromosome, I would recommend following the by-interval method in example 2, where the intervals you would give would be entire chromosomes.

    I hope this answers your question. If not, feel free to ask more questions.

  • kritikoolkritikool Member
    edited September 2016

    Thanks for the explanation.

    1) In the "by-interval method in example 2" , why did they "Use python to create the Sequencing Groupings used for BQSR and PrintReads Scatter. "

    2) I wrote some code for scattering and gathering Haplotyper Caller VCF file generation per chromosome. If you could check the correctness of it, that would be great.

    3) Does WDL have a way to create for loops ? I would like to loop the code below (ie. scattering and gathering Haplotyper Caller VCF ) for every bam file in my list.

    workflow jointCallingGenotypes {
    
      String sample_name
      String final_vcf_name
      Array[String] intervals = ["1"], ["2"], ["3"], ["4"], ["5"], ["6"], ["7"], 
    ["8"],["9"],["10"],["11"],["12"],["13"],["14"],["15"],["16"],["17"],["18"],["19"],["20"],["21"],["22"],["X"],["Y"]
      File refFasta
      File refIndex
      File refDict
      File inputBam
      File inputIndex
    
      # Call variants in parallel over WGS calling intervals
      scatter (subInterval in intervals) {
    
        # Generate VCF by interval
        call HaplotypeCaller {
          input:
            bamFile = inputBam,
            bamIndex = inputIndex,
            interval_list = subInterval,
            RefDict = refDict,
            RefFasta = refFasta,
            RefIndex = refIndex,
            sampleName = sample_name
         }
    
       }
    
      # Combine by-interval VCFs into a single sample VCF file
      call GatherVCFs {
        input:
          input_vcfs = HaplotypeCaller.VCF,
          output_vcf_name = final_vcf_name
       }
    
      task HaplotypeCaller {
    
    
      File RefFasta
      File RefIndex
      File RefDict
      String sampleName
      File bamFile
      File bamIndex
      File interval_list
    
      command {
        java -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -Xmx8000m  \
            -jar /path/GATK.jar \
            -T HaplotypeCaller \
            -R ${RefFasta} \
            -I ${bamFile} \
            -L ${interval_list}
            -o ${sampleName}.vcf
      }
      output {
        File VCF = "${sampleName}.vcf"
      }
     }
    
     ## Combine multiple VCFs from scattered HaplotypeCaller runs
    task GatherVCFs {
      Array[File] input_vcfs
      String output_vcf_name
    
      # using MergeVcfs instead of GatherVcfs so we can create indices
      # WARNING 2015-10-28 15:01:48 GatherVcfs  Index creation not currently supported when gathering block compressed VCFs.
      command {
        java -Xmx2g -jar /usr/gitc/picard.jar \
        MergeVcfs \
        INPUT=${sep=' INPUT=' input_vcfs} \
        OUTPUT=${output_vcf_name}
      }
      output {
        File output_vcf = "${output_vcf_name}"
       }
    
    Post edited by KateN on
  • Nested scatters would have been the optimal solution. Given they are not available, "you can put the inner scatter into a sub-workflow and Cromwell will be able to run it just fine," as referenced here: https://github.com/broadinstitute/cromwell/issues/838

    I have implemented this solution. Although cumbersome, it works.

  • EADGEADG KielMember ✭✭✭

    Instead of using python or hardcoding all the loci into your wdl file, you can use built-in wdl functions to create a wdl-arry from a file wich you can you provide with your input.json, so you are more flexible for the future.

    task CreateTargetedIntervals {
       File targeted_intervals
    
       command
       <<<
       echo "Just writing Intervals to an array for wdl"
       >>>
       runtime {
      [fill in your runtime args]
       }
       output {
        Array[String] intervals = read_lines("${targeted_intervals}")
       }
    
    }
    
    

    The input file has to look like:

    chr1:12918805-190424098
    chr2:25457042-209116389
    chr3:37034954-187451505
    ...
    

    or

    1
    2
    3
    ...
    

    in your case.

  • henderj8henderj8 Cincinnati Children's Hospital Member

    All,

    I am trying to understand how to scatter over a Picard generated interval list for exome analysis and was hoping to get some guidance. When I try to input the interval list into my WDL script, I get the following error when running:

    Could not coerce JsString value for 'nested.gatk4.intervals' ("/data/seq_tuning/resource/exome_calling_regions.v1.interval_list") into: WdlMaybeEmptyArrayType(WdlStringType)

    I am passing it to WDL in the workflow section as an "Array[String] intervals", then "scatter (subInterval in intervals)" Here is an example of my script:

    workflow gatk4 {
        String reference
        File FASTQ
        File FASTQ2
        String sample_id
        File GATK
        File dbSNP_vcf
        File dbSNP_vcf_index
        File known_indels_sites_VCFs
        File known_indels_sites_indices
        Array[String] intervals
        File ref_dict
        File ref_fasta
        File ref_fasta_index
        File ref_amb
        File ref_ann
        File ref_bwt
        File ref_pac
        File ref_sa
        String module_java_version
    
        call bwamem {
            input: ... }
    
        call SortSam {
            input:  ... }
    
        call MarkDuplicates {
            input: ... }
    
    scatter (subInterval in intervals) {
        call BaseRecalibrator {
        input:
            input_bam = MarkDuplicates.dedub_bam,
            recalibration_report_filename = sample_id + ".recal_data.csv",
            dbSNP_vcf = dbSNP_vcf,
            dbSNP_vcf_index = dbSNP_vcf_index,
            known_indels_sites_VCFs = known_indels_sites_VCFs,
            known_indels_sites_indices = known_indels_sites_indices,
            ref_dict = ref_dict,
            ref_fasta = ref_fasta,
            ref_fasta_index = ref_fasta_index,
            GATK = GATK,
            subInterval = subInterval,
            module_java_version = module_java_version
        }
    

    Here is the BaseRecalibrator task and command section:

    task BaseRecalibrator {
        File input_bam
        File GATK
        String recalibration_report_filename
        File dbSNP_vcf
        File dbSNP_vcf_index
        File known_indels_sites_VCFs
        File known_indels_sites_indices
        File ref_dict
        File ref_fasta
        File ref_fasta_index
        File subInterval
        String module_java_version
    
    command {
        module load ${module_java_version}
        ${GATK} --javaOptions "-Xmx4g" \
        BaseRecalibrator \
        -R=${ref_fasta} \
        -I=${input_bam} \
        -O=${recalibration_report_filename} \
        -knownSites ${dbSNP_vcf} \
        -knownSites ${known_indels_sites_VCFs} \
        -L ${sep=" -L " subInterval}
    }
    
  • ChrisLChrisL Cambridge, MAMember, Broadie, Moderator, Dev ✭✭

    @henderj8 I have an answer for you, but could you add this new question as a separate forum post? Of course, feel free to point back to this post if you think it adds useful context but it'll make life easier for people coming after you who are trying to search for the same answer! Thanks! :)

  • henderj8henderj8 Cincinnati Children's Hospital Member

    Thanks @ChrisL! I reposted the question here.

Sign In or Register to comment.