GATK4 Spark with WDL

blueskypyblueskypy Member
edited August 2017 in Ask the WDL team

GATK4 can specify Spark options with the gatk-launch command, e.g.

./gatk-launch PrintReadsSpark \
-I hdfs://path/to/input.bam -O hdfs://path/to/output.bam \
  -- \
  --sparkRunner SPARK --sparkMaster <master_url> \
  --num-executors 5 --executor-cores 2 --executor-memory 4g \
  --conf spark.yarn.executor.memoryOverhead=600

However, some of the options can also be specified using runtime block in WDL.

task sparkjob_with_yarn_cluster {
       ....
        runtime {
                appMainClass: "${entry_point}"
                executorMemory: "4G"
                executorCores: "2"
        }
}

So should those options be specified in both places or just one of them?

Tagged:

Issue · Github
by Geraldine_VdAuwera

Issue Number
2409
State
closed
Last Updated
Assignee
Array
Closed By
vdauwera

Answers

  • ChrisLChrisL Cambridge, MAMember, Broadie, Moderator, Dev

    Admittedly I don't know a huge amount about gatk-launch or which backend you're using to run these jobs, but I would guess you want to use the former.

    • If you include it in the gatk-launch command then when the Cromwell task runs, it will itself spin up a new and separate job in spark - entirely unknown to (and unmanaged by) Cromwell - which will get the memory and cpus, etc, that it needs.
    • If you include these flags in the runtime (and if the backend that you're using supports them - and FWIW I haven't seen those option names before...) - then the task that calls into Spark will get all of the resources, but spark itself will never see the requirements.

    I hope that made sense. If I misunderstood the question please let me know!

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    I would tend to agree with my learned colleague -- with the caveat that I have not run on a local Spark cluster myself, my reflex would be to give those arguments directly to GATK. Of course there's no substitute for testing..,

  • Thanks @ChrisL and @Geraldine_VdAuwera for the comments! gatk-launch calls spark-submit to submit jobs to our Spark+Yarn cluster. Those runtime attributes are from cromwell docs.

  • Because the cromwell config file also contains all the info to run spark-submit, does cromwell itself actually run spark-submit and add the command block in the task definition to spark-submit command line? If so, the command block should not call spark-submit again, i.e. one should use --sparkRunner LOCAL?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hi @blueskypy, the Spark backend was added by our Intel collaborators and we don't have a very good handle on how it works. We've asked one of the Intel devs to join this thread to help sort this out.

  • Hi, @blueskypy , i am an engineer at Intel Corporation, who contributed Spark backend to Cromwell 11 months ago with the help of an another engineer. It is a good news for us that someone else looking into its usage. So jogging my memory on the implementation, likewise in gatk-launch script or tool runs spark-submit with all switches, Cromwell script (Created internally by the code) encapsulating wdl task command with runtime attributes calls spark-submit.

    Hope that made sense, Let me know if you have any specific questions!

  • Thanks so much @DilliWala_Cric for the help! So Cromwell does call spark-submit. Just wonder if you or @Geraldine_VdAuwera knows how to run GATK4 Spark in that case? for example, should one use gatk-submit XXXSpark ... -- --sparkRunner LOCAL with Cromwell Spark+Yarn backend?

  • @blueskypy , We have not tried to run GATK4 because there was not much requirements to cover that in the first deliverable. Does gatk-submit encapsulate spark-submit ? , if gatk-submit is the entry point in GATK4, then you can specify that in the command section but then it would be spark-submit .... gatk-submit .... or instead of calling gatk-submit perhaps call directly spark-submit ?

  • blueskypyblueskypy Member
    edited August 2017

    Thanks @DilliWala_Cric ! There was a typo in my last post - gatk-submit should be gatk-launch. Here is its doc on how to run the Spark version

    gatk-launch forwards commands to GATK and adds some sugar for submitting spark jobs
    --sparkRunner controls how spark tools are run
    valid targets are:
    LOCAL: run using the in-memory spark runner
    SPARK: run using spark-submit on an existing cluster

    That's why I think I should use --sparkRunner LOCAL option; but I'm not sure what is this in-memory spark runner and whether it works with Cromwell Spark+Yarn backend.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hi @blueskypy, were you able to do what you needed? Ultimately what it comes down to is that we haven't yet scoped out the interaction between Cromwell and GATK4 spark functionality, so any functionality related to that is strictly experimental and unsupported -- but if you do figure it out we'd love to hear about how you did it so that we can help others who are in the same boat.

  • Thanks @Geraldine_VdAuwera ! If this issue can be solved, I can go ahead to do more testing.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Ah right, ok I'll ask the devs to make sure to follow up there.

Sign In or Register to comment.