We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

Disadvantages of directly using google genomics pipeline API


I’m very new to using workflow management tools and I’m trying to conceptually understand the added value of using Cromwell/WDL as opposed to using the google genomics pipeline API directly, *assuming* one is only interested in using gcp for the actual computation.

From what I understand, even if the ability to run on various backends is not of interest, Cromwell still facilitates tracking multiple, parallel jobs (like dsub), with the scatter functionality, but I wonder if there are additional advantages to using Cromwell and not the pipeline api directly. Are there things that are way more difficult to achieve when working directly with the pipeline api directly?

Apologies in advance for the ignorant question, just trying to understand the conceptual differences and added functionalities.



  • jgentryjgentry Member, Broadie, Dev ✭✭✭

    Hi -

    It's not an ignorant question at all.

    At a high level, IMO there is value to using tools like dsub and cromwell when interacting with the PAPI, as these can interpret common error cases and other edge cases and react appropriately. For instance, retrying on transient errors.

    Beyond that, I'd suggest it depends on what you're trying to do:

    • If you're just launching individual jobs, just for yourself (i.e. you're not sharing w/ others), and are either doing them as one-offs or comfortable with making a little shell script, I'd suggest using dsub (or accessing the API raw)
    • If you're just using simple pipelines (e.g. linear chain of tasks, etc), and are either doing them as one-offs or comfortable with making a little shell script, I'd suggest using dsub
    • If you're making more complicated DAG structures in your workflows, if you're looking to share your workflows, if you're looking to use other people's workflows, run them in other environments (I know you said this wasn't an issue), etc - i'd recommend using cromwell.

    You can probably pick it up from the above, but the tradeoff axes work out to be roughly "more power & flexibility" vs "more complexity". The scenarios I described above are increasing in both power and/or need of flexibility, but each one introduces a higher degree of complexity. Just as dsub abstracts away a certain amount of functionality in PAPI, Cromwell adds a higher level of orchestration & abstraction, but this comes at the cost of needing to understand more tooling, etc.

    Does this make sense?


  • taliravehtaliraveh IsraelMember
    Thanks for the detailed response! It helps A LOT.

    Just to clarify, our use case is such where we will be launching a lot of jobs (so retrying and picking up from point of failure are important), but only 2-3 linear, relatively straightforward pipelines that won't be shared or run on other platforms. We are very comfortable with coding, and so, from what I can gather, something like dsub can be very useful in keeping track and restarting failed jobs, but we won't be gaining much from the added functionalities that cromwell can offer and that mostly relate to being able to run cross platforms or with complicated logic.

    Please correct me if I am wrong here. I really appreciate this, it really helps in clarifying the bigger picture.
Sign In or Register to comment.