Test-drive the GATK tools and Best Practices pipelines on Terra

Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Which version of Spark works well with the current GATK 4.0?


I am interested in using GATK4.0 with Spark on my PC. From previous posts I noticed that there is compatibility issue (errors) of GATK4 with the latest version of Spark (2.1 or 2.2 if I am correct). Which version of Spark should I use to run GATK4.0 ?

Furthermore, will the calling results be different between the GATK-spark and GATK-withoutspark versions?

Thank you so much !



  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @Rossini,

    GATK needs to be run in UNIX or LINUX. Is your PC set to run one of these environments?

    I think you may find taking a minute to read over the developer notes about building your own jars helpful: https://github.com/broadinstitute/gatk#building.

    As you can see, the jar (sparkJar) that is used to run on a Spark cluster (but not locally) will NOT include Spark and Hadoop libraries. What this means is that the cluster's Spark and Hadoop libraries are used.

    Which brings me to...

    Our Spark tools are mostly tested on Google Dataproc clusters, which come with their own Spark installations. https://cloud.google.com/dataproc/docs/concepts/versioning/dataproc-versions shows that the latest Dataproc version, v1.2 comes with Apache Spark 2.2.0. The prior version v1.1 comes with Apache Spark 2.0.2. It is safe to assume the GATK sparkJar is compatible with these versions of Spark.

    Let's ask @Sheila to answer whether results can be different.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin


    The Spark results should be the same as the non-Spark results. The only difference is that the Spark tools run faster on larger data. This thread may help as well.


Sign In or Register to comment.