[4] (howto) Use scatter-gather to joint call genotypes

KateNKateN Cambridge, MAMember, Broadie, Moderator admin
edited January 2017 in Tutorials

Requirements

This tutorial assumes that you have a (very) basic understanding of GATK tools and have read the Getting Started guide. You should also have installed the necessary tools and gone through the previous tutorial, as we will build off concepts learned there. Lastly, you will need to download the zip bundle containing this tutorial's data. We will use a toy dataset: NA12878, NA12877, and NA12882, subset to chromosome 20. For a detailed description of each file in the bundle, see the README.


Introduction

Our previous tutorials have given you a solid base in WDL, but now we’re going to test your skills with a particularly complex workflow structure: a scatter operation. To illustrate this feature, we’ve chosen to pull the joint calling variant discovery section of the GATK Best Practices pipeline. HaplotypeCaller takes bams and outputs genotype likelihoods for every possible variant site. Then GenotypeGVCFs uses genotype likelihoods from multiple samples to call variants across a cohort. This method is most often used with large cohorts, but we will be executing it on only 3 samples simply to give you an idea of how it works.

image

As you can see from this diagram, there are a lot of pieces in this workflow. In previous tutorials, the workflows were simple enough that the diagrams may have seemed unnecessary, but once you get to more complex examples like this one, the utility of the diagram is a lot more obvious.

We start from inputSamples, an array that contains samples. Each sample is an array with three Files*. The first position (array index 0) contains the name of our sample, followed by the bam with our data (array index 1), and the index file for that bam (array index 2). In this example, we will scatter by sample, then gather after the first step, HaplotypeCallerERC. The gathered GVCFs are then processed by GenotypeGVCFs to produce a multisample VCF of raw variant calls that are ready for filtering in the next step of the pipeline.

*Files can also simply be strings, with no need to alter type, as is the case for sampleName.


Write your script

Before we get to writing the workflow itself, we need to generate a file that will allow us to easily input our complex Array[Array[File]] inputSamples. Open up a new text file, which we’ve chosen to call inputsTSV.txt, and list each sample’s inputs you need on a new line, in the following order:

sampleName  inputBam  bamIndex
sampleName2 inputBam2 bamIndex2
...etc.

Each input is separated by a tab character, and each sample is on its own line. This file saves you from having to deal with nested array declarations in your inputs json later.

Now, open up a new file in a text editor of your choice and call it jointCallingGenotypes.wdl. All of our script writing will be done in this file.


Workflow

Starting with our workflow inputs, we will need the inputSamplesFile we created above. To convert that file into an Array[Array[File]], we can use a function from the WDL standard library, called read_tsv.

  Array[Array[File]] inputSamples = read_tsv(inputSamplesFile)

Now, to scatter by sample, we wrap our call to the task HaplotypeCallerERC in a scatter operation, like so:

  scatter (sample in inputSamples) {
    call HaplotypeCallerERC {...}
  }

The scatter operation has an implied gather step at the end, meaning you don’t have to do anything but call your next task outside the operation. The individual GVCFs produced by each parallel call to HaplotypeCallerERC is grouped into an array of GVCFs that you can pass directly to the next task.

  scatter (sample in inputSamples) {
    call HaplotypeCallerERC {...}
  }
  call GenotypeGVCFs { input: GVCFs=HaplotypeCallerERC.GVCF }

Now let’s look at it all together, with the inputs specified in the task call. As we have done in previous tutorials, we have added in the pass-through variables (gatk, refFasta, etc.) as well.

workflow jointCallingGenotypes {

  File inputSamplesFile
  Array[Array[File]] inputSamples = read_tsv(inputSamplesFile)
  File gatk
  File refFasta
  File refIndex
  File refDict

  scatter (sample in inputSamples) {
    call HaplotypeCallerERC {
      input: GATK=gatk, 
        RefFasta=refFasta, 
        RefIndex=refIndex, 
        RefDict=refDict, 
        sampleName=sample[0],
        bamFile=sample[1], 
        bamIndex=sample[2]
    }
  }
  call GenotypeGVCFs {
    input: GATK=gatk, 
      RefFasta=refFasta, 
      RefIndex=refIndex, 
      RefDict=refDict, 
      sampleName="CEUtrio", 
      GVCFs=HaplotypeCallerERC.GVCF
  }
}

Now, armed with a completed workflow script, we venture on to task declarations!

Task

HaplotypeCallerERC

At this point in our tutorial series, we have written quite a few task declarations. As always, we begin by declaring inputs. We need to specify all pass-through variable names, as well as task-specific inputs, following the diagram. Since we have handled parsing the complicated Array[Array[File]] at the workflow level, our task-specific inputs here are simply String sampleName, File bamFile, and File bamIndex.

task HaplotypeCallerERC {
  File GATK
  File RefFasta
  File RefIndex
  File RefDict
  String sampleName
  File bamFile
  File bamIndex
}

In previous tutorials we have used the HaplotypeCaller tool in normal mode. This tutorial applies the joint calling method, so we will be running the tool in reference confidence mode. The command is the same as it was in previous tutorials, except with an added -ERC GVCF to enable reference confidence mode.

    java -jar ${GATK} \
        -T HaplotypeCaller \
        -ERC GVCF \
        -R ${RefFasta} \
        -I ${bamFile} \
        -o ${sampleName}_rawLikelihoods.g.vcf

Lastly, we need to assign our tool’s output to an output variable in WDL. This is especially important if you are not running this script on your local machine--often, to save space, platforms will delete any files generated that are not assigned an output variable when the entire workflow finishes.

  output {
    File GVCF = "${sampleName}_rawLikelihoods.g.vcf"
  }

Add the output section in your task, below the command, and you have a completed WDL task. If you’ve been following along, your script should look like this:

task HaplotypeCallerERC {

  File GATK
  File RefFasta
  File RefIndex
  File RefDict
  String sampleName
  File bamFile
  File bamIndex

  command {
    java -jar ${GATK} \
        -T HaplotypeCaller \
        -ERC GVCF \
        -R ${RefFasta} \
        -I ${bamFile} \
        -o ${sampleName}_rawLikelihoods.g.vcf
  }
  output {
    File GVCF = "${sampleName}_rawLikelihoods.g.vcf"
  }
}

GenotypeGVCFs

This task takes in the compiled array of GVCFs from HaplotypeCallerERC, so in addition to our usual pass-through variables, we will take in Array[File] GVCFs. As with previous tasks, we also give this one a String sampleName to assist in naming the output file.

task GenotypeGVCFs {
  File GATK
  File RefFasta
  File RefIndex
  File RefDict
  String sampleName
  Array[File] GVCFs
}

When calling GenotypeGVCFs, you must specify each GVCF input with a separate -V. We have an array of the GVCF files, and so we can use a delimiter in order to insert a -V between each item in the array. You’ve seen before how to reference a variable in WDL, ${var}, and you can specify, the delimiter itself using sep within the variable reference, like so: ${sep=” -V “ var}. With this in mind, our command will follow the below format:

    java -jar ${GATK} \
        -T GenotypeGVCFs \
        -R ${RefFasta} \
        -V ${sep=" -V " GVCFs} \
        -o ${sampleName}_rawVariants.vcf

Note that we typed a preceding -V in the above command-- this is to add a -V before the first item in the array, which the sep delimiter will not do. To wrap this task up, assign your output variable, and put everything together.

task GenotypeGVCFs {

  File GATK
  File RefFasta
  File RefIndex
  File RefDict
  String sampleName
  Array[File] GVCFs

  command {
    java -jar ${GATK} \
        -T GenotypeGVCFs \
        -R ${RefFasta} \
        -V ${sep=" -V " GVCFs} \
        -o ${sampleName}_rawVariants.vcf
  }
  output {
    File rawVCF = "${sampleName}_rawVariants.vcf"
  }
}

Running the pipeline

To run your pipeline: validate it, then generate and fill in an inputs file, then run your script locally. We have done so using the following commands:

#validate:
java -jar wdltool.jar validate jointCallingGenotypes.wdl

#generate inputs:
java -jar wdltool.jar inputs jointCallingGenotypes.wdl > jointCallingGenotypes_inputs.json

#run locally:
java -jar cromwell.jar run jointCallingGenotypes.wdl jointCallingGenotypes_inputs.json

While running, Cromwell will show update messages on your terminal until the workflow is complete.


Check your results

The terminal will print out the location of your final output at the very end. Let’s take a look at our results in IGV. Open the final VCF file, and zoom in on a portion of chromosome 20 (20:10,000,000-10,200,000)*. You will see what is pictured below. The top track, CEUtrio_rawVariants.vcf, is the joint called variants across the whole cohort. Below that, each individual sample is be listed, along with its variants.

*Note: Our original files were subset to this particular region, which is why the variants are isolated here.

image

If you zoom in further, you can take a look at individual variants. Below we can see two variant sites that were called in the SNAP25-AS1 gene. At the first site, one sample (NA12877) is hom ref, whereas the other two are het. At the second site, two are het, and NA12822 is hom var.

image

In this tutorial, you have learned one of the more complex techniques in WDL: scattering. Scattering can be done over files, as demonstrated here, or over any other primitive types in WDL (String, Int, etc.). Stay tuned for additional WDL tutorials.

Post edited by KateN on

Comments

  • aleksejs_sazonovsaleksejs_sazonovs Hinxton, UKMember

    generate inputs:

    java -jar cromwell.jar inputs jointCallingGenotypes.wdl > jointCallingGenotypes_inputs.json

    Seems that wdltool should be used to generate the input template (wdltool 0.4, cromwell 0.21)

  • KateNKateN Cambridge, MAMember, Broadie, Moderator admin

    Thank you @aleksejs_sazonovs! That is a relic from when Cromwell and wdltool used to be one single jar file. I've updated the documentation to reflect the current state of functionality.

  • alphahmedalphahmed JAPANMember

    Thanks again for the great tutorial.

    I'd like to know about the best way to continue the scattered function through a pipe-like line of tasks, before being scattered.

    The fact that "scatter" can generate an input array from a text file is very convenient; however, I'd like to continue calling tasks on the output of each process before having the auto-gathering at the end of the first call. Can I just keep calling the following tasks under the scatter (indented) function? Can I just specify the input file using the array's sample number? How can I keep it in parallel though?
    If I use the ${sep=" -V " ...} script, it implies that the array is a multiple files' input for a task that accepts multiple inputs, rather than separate processes.

    I'd really appreciate a kind clarification; meanwhile, I'll keep trying implementing different ways.

    Ahmed

  • alphahmedalphahmed JAPANMember
    Any guidance on how to call multiple tasks on the output of scatter (output array) separately, before gathering?
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
    Sure, you can just put all those tasks inside the scatter. The tasks will be applied per unit of the scatter, forming linear chains that proceed in parallel until you choose to gather the outputs.
  • sryan6sryan6 University of Notre DameMember
    edited October 2017

    For the input array file with sample names and bam files (i.e., inputsTSV.txt from the example above) , should the names be the same if multiple bam files belong to the same sample (same SM ID in the @RG), or should these names be unique? (how is it using the sampleName, for the loop or for HaplotypeCaller?)

    Now that I think about it, all bam files with the same SM ID must all be merged given they are running on separate chains, correct? So then the best approach would be to merge the bam files or run scatter by interval/chromosomes.

  • theomarkertheomarker qingdaoMember

    Dear GATK team,
    I tried scatter-gather (haplotypecaller then genotypeGVCFs) on 100 exome BAMs on a node (only 40 cores) on the cluster. But the node seems unaccessible now. Is that because the memory issue? Or do I need to prepare a tmpdir (-Djava.io.tmpdir=/tmp) for haplotypecaller and genotypeGVCFs?
    although I found there seem to have a tmpdir for each exome

    Picked up _JAVA_OPTIONS: -Djava.io.tmpdir=/work/home/jiahuan/Heart_WES/HTX/cromwell-executions/jointCallingGenotypes/d71d6520-d369-43c8-bfce-f0521358f8e9/call-HaplotypeCallerERC/shard-0/execution/tmp.2jWDLA
    

    here is part of the log:

    2018-03-28 21:20:13,945 INFO  - BackgroundConfigAsyncJobExecutionActor [UUID(d71d6520)jointCallingGenotypes.HaplotypeCallerERC:84:1]: job id: 4937
    2018-03-28 21:20:13,955 INFO  - BackgroundConfigAsyncJobExecutionActor [UUID(d71d6520)jointCallingGenotypes.HaplotypeCallerERC:84:1]: Status change from - to WaitingForReturnCodeFile
    

    Thanks!

  • hkewardhkeward Member

    In GATK 4, the GenotypeGVCFs tool only takes a single sample as input so you need to call an additional tool -- CombineGVCFs -- and send the output, which is now a single file, to GenotypeGVCFs.

  • tytolintytolin Member

    @hkeward said:
    In GATK 4, the GenotypeGVCFs tool only takes a single sample as input so you need to call an additional tool -- CombineGVCFs -- and send the output, which is now a single file, to GenotypeGVCFs.

    So did you mean that the CombineGVCFs can generate a vcf files containing different individuals or different samples?

    For example: I want to gather the WGS data of different dogs, so I can call vcf files wit HaplotypeCaller for individual. And then, I gather the different dogs' VCF into one VCF that contain different dogs and we can recognize each dog's genotype from the VCF file?

  • Dear GATK team,
    Firstly, great tuitional, really helped a lot, Thanks!

    I tried to run this with gatk4 and just use gatk3 to run the CombineGVCFs part, succeed.
    But failed when I tried to run it totally with gatk4.(via call CombineGVCFs to merge 1 GVCF file as input for GenotypeGVCFs) and I got this fault report:
    A USER ERROR has occurred: An index is required but was not found for file drivingVariantFile:/Users/angry_panda/Documents/cromwell/jointCallingGenotypes2/cromwell-executions/jointCallingGenotypes/cbf1d4f5-9dcb-4acd-b3b8-4d45f5bd6fcd/call-GenotypeGVCFs/inputs/976973035/GVCF_cohort.g.vcf.gz. Support for unindexed block-compressed files has been temporarily disabled. Try running IndexFeatureFile on the input.
    I didn't understand it. Could you help me to fix it?

  • blueskypyblueskypy Member ✭✭
    edited December 2018

    what happens if one item of the scatter fails, does the gather stop or proceed w/o the failed item?

    where can such action of gather be defined by users? After the error in the item is fixed, how to continue running the workflow from where it stops?

Sign In or Register to comment.