Heads up:
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
Update: July 26, 2019
This section of the forum is now closed; we are working on a new support model for WDL that we will share here shortly. For Cromwell-specific issues, see the Cromwell docs and post questions on Github.
This section of the forum is now closed; we are working on a new support model for WDL that we will share here shortly. For Cromwell-specific issues, see the Cromwell docs and post questions on Github.
GATK4 Spark with WDL

GATK4 can specify Spark options with the gatk-launch command, e.g.
./gatk-launch PrintReadsSpark \ -I hdfs://path/to/input.bam -O hdfs://path/to/output.bam \ -- \ --sparkRunner SPARK --sparkMaster <master_url> \ --num-executors 5 --executor-cores 2 --executor-memory 4g \ --conf spark.yarn.executor.memoryOverhead=600
However, some of the options can also be specified using runtime block in WDL.
task sparkjob_with_yarn_cluster { .... runtime { appMainClass: "${entry_point}" executorMemory: "4G" executorCores: "2" } }
So should those options be specified in both places or just one of them?
Answers
Admittedly I don't know a huge amount about
gatk-launch
or which backend you're using to run these jobs, but I would guess you want to use the former.gatk-launch
command then when the Cromwell task runs, it will itself spin up a new and separate job in spark - entirely unknown to (and unmanaged by) Cromwell - which will get the memory and cpus, etc, that it needs.I hope that made sense. If I misunderstood the question please let me know!
I would tend to agree with my learned colleague -- with the caveat that I have not run on a local Spark cluster myself, my reflex would be to give those arguments directly to GATK. Of course there's no substitute for testing..,
Thanks @ChrisL and @Geraldine_VdAuwera for the comments!
gatk-launch
callsspark-submit
to submit jobs to our Spark+Yarn cluster. Those runtime attributes are from cromwell docs.Because the cromwell config file also contains all the info to run
spark-submit
, does cromwell itself actually runspark-submit
and add thecommand
block in thetask
definition tospark-submit
command line? If so, thecommand
block should not callspark-submit
again, i.e. one should use--sparkRunner LOCAL
?Hi @blueskypy, the Spark backend was added by our Intel collaborators and we don't have a very good handle on how it works. We've asked one of the Intel devs to join this thread to help sort this out.
Hi, @blueskypy , i am an engineer at Intel Corporation, who contributed Spark backend to Cromwell 11 months ago with the help of an another engineer. It is a good news for us that someone else looking into its usage. So jogging my memory on the implementation, likewise in gatk-launch script or tool runs spark-submit with all switches, Cromwell script (Created internally by the code) encapsulating wdl task command with runtime attributes calls spark-submit.
Hope that made sense, Let me know if you have any specific questions!
Thanks so much @DilliWala_Cric for the help! So Cromwell does call
spark-submit
. Just wonder if you or @Geraldine_VdAuwera knows how to run GATK4 Spark in that case? for example, should one usegatk-submit XXXSpark ... -- --sparkRunner LOCAL
with Cromwell Spark+Yarn backend?@blueskypy , We have not tried to run GATK4 because there was not much requirements to cover that in the first deliverable. Does gatk-submit encapsulate spark-submit ? , if gatk-submit is the entry point in GATK4, then you can specify that in the command section but then it would be
spark-submit .... gatk-submit ....
or instead of calling gatk-submit perhaps call directly spark-submit ?Thanks @DilliWala_Cric ! There was a typo in my last post -
gatk-submit
should begatk-launch
. Here is its doc on how to run the Spark versionThat's why I think I should use
--sparkRunner LOCAL
option; but I'm not sure what is this in-memory spark runner and whether it works with Cromwell Spark+Yarn backend.Hi @blueskypy, were you able to do what you needed? Ultimately what it comes down to is that we haven't yet scoped out the interaction between Cromwell and GATK4 spark functionality, so any functionality related to that is strictly experimental and unsupported -- but if you do figure it out we'd love to hear about how you did it so that we can help others who are in the same boat.
Thanks @Geraldine_VdAuwera ! If this issue can be solved, I can go ahead to do more testing.
Ah right, ok I'll ask the devs to make sure to follow up there.