We've moved!
For WDL questions, see the WDL specification and WDL docs.
For Cromwell questions, see the Cromwell docs and please post any issues on Github.

GATK4 Spark with WDL

blueskypyblueskypy Member ✭✭

GATK4 can specify Spark options with the gatk-launch command, e.g.

./gatk-launch PrintReadsSpark \
-I hdfs://path/to/input.bam -O hdfs://path/to/output.bam \
  -- \
  --sparkRunner SPARK --sparkMaster <master_url> \
  --num-executors 5 --executor-cores 2 --executor-memory 4g \
  --conf spark.yarn.executor.memoryOverhead=600

However, some of the options can also be specified using runtime block in WDL.

task sparkjob_with_yarn_cluster {
        runtime {
                appMainClass: "${entry_point}"
                executorMemory: "4G"
                executorCores: "2"

So should those options be specified in both places or just one of them?


Issue · Github
by Geraldine_VdAuwera

Issue Number
Last Updated
Closed By


  • ChrisLChrisL Cambridge, MAMember, Broadie, Dev admin

    Admittedly I don't know a huge amount about gatk-launch or which backend you're using to run these jobs, but I would guess you want to use the former.

    • If you include it in the gatk-launch command then when the Cromwell task runs, it will itself spin up a new and separate job in spark - entirely unknown to (and unmanaged by) Cromwell - which will get the memory and cpus, etc, that it needs.
    • If you include these flags in the runtime (and if the backend that you're using supports them - and FWIW I haven't seen those option names before...) - then the task that calls into Spark will get all of the resources, but spark itself will never see the requirements.

    I hope that made sense. If I misunderstood the question please let me know!

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    I would tend to agree with my learned colleague -- with the caveat that I have not run on a local Spark cluster myself, my reflex would be to give those arguments directly to GATK. Of course there's no substitute for testing..,

  • blueskypyblueskypy Member ✭✭

    Thanks @ChrisL and @Geraldine_VdAuwera for the comments! gatk-launch calls spark-submit to submit jobs to our Spark+Yarn cluster. Those runtime attributes are from cromwell docs.

  • blueskypyblueskypy Member ✭✭

    Because the cromwell config file also contains all the info to run spark-submit, does cromwell itself actually run spark-submit and add the command block in the task definition to spark-submit command line? If so, the command block should not call spark-submit again, i.e. one should use --sparkRunner LOCAL?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @blueskypy, the Spark backend was added by our Intel collaborators and we don't have a very good handle on how it works. We've asked one of the Intel devs to join this thread to help sort this out.

  • Hi, @blueskypy , i am an engineer at Intel Corporation, who contributed Spark backend to Cromwell 11 months ago with the help of an another engineer. It is a good news for us that someone else looking into its usage. So jogging my memory on the implementation, likewise in gatk-launch script or tool runs spark-submit with all switches, Cromwell script (Created internally by the code) encapsulating wdl task command with runtime attributes calls spark-submit.

    Hope that made sense, Let me know if you have any specific questions!

  • blueskypyblueskypy Member ✭✭

    Thanks so much @DilliWala_Cric for the help! So Cromwell does call spark-submit. Just wonder if you or @Geraldine_VdAuwera knows how to run GATK4 Spark in that case? for example, should one use gatk-submit XXXSpark ... -- --sparkRunner LOCAL with Cromwell Spark+Yarn backend?

  • @blueskypy , We have not tried to run GATK4 because there was not much requirements to cover that in the first deliverable. Does gatk-submit encapsulate spark-submit ? , if gatk-submit is the entry point in GATK4, then you can specify that in the command section but then it would be spark-submit .... gatk-submit .... or instead of calling gatk-submit perhaps call directly spark-submit ?

  • blueskypyblueskypy Member ✭✭
    edited August 2017

    Thanks @DilliWala_Cric ! There was a typo in my last post - gatk-submit should be gatk-launch. Here is its doc on how to run the Spark version

    gatk-launch forwards commands to GATK and adds some sugar for submitting spark jobs
    --sparkRunner controls how spark tools are run
    valid targets are:
    LOCAL: run using the in-memory spark runner
    SPARK: run using spark-submit on an existing cluster

    That's why I think I should use --sparkRunner LOCAL option; but I'm not sure what is this in-memory spark runner and whether it works with Cromwell Spark+Yarn backend.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @blueskypy, were you able to do what you needed? Ultimately what it comes down to is that we haven't yet scoped out the interaction between Cromwell and GATK4 spark functionality, so any functionality related to that is strictly experimental and unsupported -- but if you do figure it out we'd love to hear about how you did it so that we can help others who are in the same boat.

  • blueskypyblueskypy Member ✭✭

    Thanks @Geraldine_VdAuwera ! If this issue can be solved, I can go ahead to do more testing.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Ah right, ok I'll ask the devs to make sure to follow up there.

Sign In or Register to comment.