To celebrate the release of GATK 4.0, we are giving away free credits for running the GATK4 Best Practices pipelines in FireCloud, our secure online analysis portal. It’s first come first serve, so sign up now to claim your free credits worth $250. Sponsored by Google Cloud. Learn more at https://software.broadinstitute.org/firecloud/documentation/freecredits

Which version of Spark works well with the current GATK 4.0?

Hi,

I am interested in using GATK4.0 with Spark on my PC. From previous posts I noticed that there is compatibility issue (errors) of GATK4 with the latest version of Spark (2.1 or 2.2 if I am correct). Which version of Spark should I use to run GATK4.0 ?

Furthermore, will the calling results be different between the GATK-spark and GATK-withoutspark versions?

Thank you so much !

Tagged:

Answers

  • shleeshlee CambridgeMember, Broadie, Moderator

    Hi @Rossini,

    GATK needs to be run in UNIX or LINUX. Is your PC set to run one of these environments?

    I think you may find taking a minute to read over the developer notes about building your own jars helpful: https://github.com/broadinstitute/gatk#building.

    As you can see, the jar (sparkJar) that is used to run on a Spark cluster (but not locally) will NOT include Spark and Hadoop libraries. What this means is that the cluster's Spark and Hadoop libraries are used.

    Which brings me to...

    Our Spark tools are mostly tested on Google Dataproc clusters, which come with their own Spark installations. https://cloud.google.com/dataproc/docs/concepts/versioning/dataproc-versions shows that the latest Dataproc version, v1.2 comes with Apache Spark 2.2.0. The prior version v1.1 comes with Apache Spark 2.0.2. It is safe to assume the GATK sparkJar is compatible with these versions of Spark.

    Let's ask @Sheila to answer whether results can be different.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @Rossini
    Hi,

    The Spark results should be the same as the non-Spark results. The only difference is that the Spark tools run faster on larger data. This thread may help as well.

    -Sheila

Sign In or Register to comment.