If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra

Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
We will be out of the office on October 14, 2019, due to the U.S. holiday. We will return to monitoring the forum on October 15.

Detailed documentation of how GATK tools employ SPARK

psb21psb21 BejaMember

Good afternoon,

It's been a while since GATK4 is out and Spark tools got introduced (yeyyy:)), but so far I haven't been able to find a good link to read on how exactly GATK employs it.

If you could fill these pages with some content would be great (single multi core,spark cluster). Particularly, I'm interested to know how the jobs are managed like: if running locally with for instance local[40], how does Haplotype Caller traverses the data ? Does the Active Region traversal still applies for the SPARK tools ? What about the concept of Walkers? How many blocks of data each Spark RDD contains ? Have you done some tests to improve performance, or mostly rely on default Spark settings to manage parallelism ?

Best regards,


  • bhanuGandhambhanuGandham admin Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @psb21

    Here is a spark document that should help with your questions:

    Let me know if you still have questions after.

  • psb21psb21 BejaMember


    Thanks @bhanuGandham for the link to this new post. It surely helps, but I still find it difficult to understand the core methodological changes from non-spark to spark tools. What do you mean by "sharding boundary effects" ? For instance, in a variant calling pipeline, the genotype likelihoods for a given interval do not depend from those calculated in another region (e.g. different chromosome) right? In what sense is this a problem to match the non-spark Haplotype Caller results?

    I'm trying to speed up the process of calling variants using SPARK. I have access to a slurm HPC cluster, so I guess it's not that straightforward to run GATK in a proper distributed master-slave architecture (if there is any tutorial on how to setup slurm jobs to use GATK Spark tools on multiple nodes I would appreciate it a lot).
    Therefore, I run GATK in local mode with some SPARK threads, eventually speeding up the process by parallelising the number of samples processed simultaneously with GNU parallel. But then, I'm having troubles because some samples crash due to SPARK errors. Perhaps you could send my logs to the developers ? I'm trying to run 8 parallel GATk jobs (8 samples) using 5 Spark cpus on each in a node with 40 cpus.


    Issue · Github
    by bhanuGandham

    Issue Number
    Last Updated
  • bhanuGandhambhanuGandham admin Cambridge MAMember, Administrator, Broadie, Moderator admin
    edited February 24

    HI @psb21

    I am looking into finding detailed answers for you. I will get back to you soon.
    In the mean time please refer to these docs giving more info on spark:

  • bhanuGandhambhanuGandham admin Cambridge MAMember, Administrator, Broadie, Moderator admin
    edited February 25

    Hi @psb21

    I spoke with the dev team and this is what they have to say:
    1) Would you please ensure that you are using GATK4.1 version of HaplotypeCallerSpark tool. As most of these issues have been resolved in this version.
    2) You could use --strict mode in order to ensure that the spark and non spark version of tools are concordant with each other. This will be a slower process though.
    3) We would like to remind our users that HaplotypeCallerSpark is in beta and we are constantly working on improving it. It will have lots more improvements in the near future.

    You can follow this github issue for more info:

    Post edited by bhanuGandham on
Sign In or Register to comment.