To celebrate the release of GATK 4.0, we are giving away free credits for running the GATK4 Best Practices pipelines in FireCloud, our secure online analysis portal. It’s first come first serve, so sign up now to claim your free credits worth $250. Sponsored by Google Cloud. Learn more at https://software.broadinstitute.org/firecloud/documentation/freecredits

Scatter-Gather over Picard Generated Interval List?

henderj8henderj8 Cincinnati Children's Hospital Member

Hey WDL Team,

I am trying to understand how to scatter over a Picard generated interval list for exome analysis and was hoping to get some guidance. When I try to input the interval list into my WDL script, I get the following error when running:

Could not coerce JsString value for 'nested.gatk4.intervals' ("/data/seq_tuning/resource/exome_calling_regions.v1.interval_list") into: WdlMaybeEmptyArrayType(WdlStringType)

I am passing it to WDL in the workflow section as an "Array[String] intervals", then "scatter (subInterval in intervals)" Here is an example of my script:

workflow gatk4 {
    String reference
    File FASTQ
    File FASTQ2
    String sample_id
    File GATK
    File dbSNP_vcf
    File dbSNP_vcf_index
    File known_indels_sites_VCFs
    File known_indels_sites_indices
    Array[String] intervals
    File ref_dict
    File ref_fasta
    File ref_fasta_index
    File ref_amb
    File ref_ann
    File ref_bwt
    File ref_pac
    File ref_sa
    String module_java_version

    call bwamem {
    input: ... }

    call SortSam {
    input: ... }

    call MarkDuplicates {
        input: ... }

scatter (subInterval in intervals) {
    call BaseRecalibrator {
    input:
        input_bam = MarkDuplicates.dedub_bam,
        recalibration_report_filename = sample_id + ".recal_data.csv",
        dbSNP_vcf = dbSNP_vcf,
        dbSNP_vcf_index = dbSNP_vcf_index,
        known_indels_sites_VCFs = known_indels_sites_VCFs,
        known_indels_sites_indices = known_indels_sites_indices,
        ref_dict = ref_dict,
        ref_fasta = ref_fasta,
        ref_fasta_index = ref_fasta_index,
        GATK = GATK,
        subInterval = subInterval,
        module_java_version = module_java_version
    }

Here is the BaseRecalibrator task and command section:

task BaseRecalibrator {
    File input_bam
    File GATK
    String recalibration_report_filename
    File dbSNP_vcf
    File dbSNP_vcf_index
    File known_indels_sites_VCFs
    File known_indels_sites_indices
    File ref_dict
    File ref_fasta
    File ref_fasta_index
    File subInterval
    String module_java_version

command {
    module load ${module_java_version}
    ${GATK} --javaOptions "-Xmx4g" \
    BaseRecalibrator \
    -R=${ref_fasta} \
    -I=${input_bam} \
    -O=${recalibration_report_filename} \
    -knownSites ${dbSNP_vcf} \
    -knownSites ${known_indels_sites_VCFs} \
    -L ${sep=" -L " subInterval}
}

This references a prior comment asked here.

Answers

  • ChrisLChrisL Cambridge, MAMember, Broadie, Dev

    Hey @henderj8 -

    The problem here is that your WDL wants an Array[String] for intervals but you're providing it with a single String value ("/data/seq_tuning/resource/exome_calling_regions.v1.interval_list").

    Based on the filename, I'm going to guess that this is a File containing the list, and based on your WDL, I'm going to guess that each line in the file should be read as a String. Ok, so to make this work should be as simple as replacing the line Array[String] intervals with these two lines:

    File intervals_file
    Array[String] intervals = read_lines(intervals_file)
    

    And updating your inputs.json to supply the file path for intervals_file instead of intervals directly.

    Hope that helps!

  • henderj8henderj8 Cincinnati Children's Hospital Member

    Thank you @ChrisL!

    That solved my problem but I ran into a similar problem that was discussed here, where my interval file is too big for Cromwell to read.. Instead of using that picard interval list, is there a way I could just scatter by chromosome for the time being? More for the purpose of testing the rest of the pipeline.

    Say I generate a .txt file listing all chromosomes (1,2,3.. etc.), if I pass the WDL script that file in place of where the picard list was would that do the trick or is it more complicated than that?

    Thanks again for your help!

  • ChrisLChrisL Cambridge, MAMember, Broadie, Dev

    @henderj8 how big is the interval list? I'm sort-of surprised that an interval list is hitting the file size limit (if it's not an insane size and you're the only one using your Cromwell instance, you could potentially raise that limit for yourself)

    What would be in the chromosome list? If it's just a sequence of integers you can try using the range(n) function to generate an equivalent Array[Int] without having to worry about reading them from a file.

  • henderj8henderj8 Cincinnati Children's Hospital Member
    edited October 2017

    I actually ended up using the SplitIntervals tool to divide my huge picard interval file and then I passed my WDL script that output array of interval files. Worked like a charm! Thanks @ChrisL!

Sign In or Register to comment.