We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
This section of the forum is now closed; we are working on a new support model for WDL that we will share here shortly. For Cromwell-specific issues, see the Cromwell docs and post questions on Github.
Automation beyond cromwell

Background: I work in food safety, and we type our isolates using MLST, and then perform a SNP analysis within each sequence type to determine whether we have related isolates. For the SNPs analysis, we also include restrospective isolates with that Sequence Type we found previously.
I am looking for a way to automate this process. Right now, it's quite a lot of work to extract all isolates with a certain Sequence Type, get the appropriate reference, run the wdltool inputs
, make sure all settings are correct, and run cromwell. And then once we get new data, we have to do it all over again with one (or more) additional sample(s). Of course call caching helps with the computational time, but not with the 'manual' time we have to spent setting up each analysis.
Are there any tools available that can help queue up cromwell runs for each MLST type? I imagine re-running a data set with one extra sample is something that is pretty common.
Best Answer
-
kshakir ✭✭
Automating submissions to Cromwell is beyond the scope of these particular support forums. Still in my limited experience I've seen a couple of cases where custom software tools were built to help submit to Cromwell's REST API when some upstream event happened.
Most of the software I can think of is polling some sort of message queue. That could be implemented with a shared database, but in scalable cases was a message broker like ActiveMQ. An event-message would be placed into the queue by some other tool. Then a customized "cromwell-submitter" would be subscribed to and polling the queue every few seconds. Once a basic message is received, the submitter may gather other information for the system environment-- in your case perhaps the sequence type, reference, etc.-- and may run other (limited) pre-processing. Once the custom input json, workflow options, etc. are all built up, the WDL is then submitted by REST API to Cromwell.
As you already stated and to reinforce it for others coming across this thread-- In the cases of "re-running a data set with one extra sample" I've sometimes seen folks use Cromwell's call-caching techniques where the entire-previous-dataset-plus-one is auto-generated as a new input json, and then it's left to the WDL/Cromwell to figure out that a number of the jobs have already been run. In that case the majority of results are just copied/linked and only the "+ 1" job(s) are run.
Hope this helps!
Answers
Hi @Redmar_van_den_Berg,
Automating submissions to Cromwell is beyond the scope of these particular support forums. Still in my limited experience I've seen a couple of cases where custom software tools were built to help submit to Cromwell's REST API when some upstream event happened.
Most of the software I can think of is polling some sort of message queue. That could be implemented with a shared database, but in scalable cases was a message broker like ActiveMQ. An event-message would be placed into the queue by some other tool. Then a customized "cromwell-submitter" would be subscribed to and polling the queue every few seconds. Once a basic message is received, the submitter may gather other information for the system environment-- in your case perhaps the sequence type, reference, etc.-- and may run other (limited) pre-processing. Once the custom input json, workflow options, etc. are all built up, the WDL is then submitted by REST API to Cromwell.
As you already stated and to reinforce it for others coming across this thread-- In the cases of "re-running a data set with one extra sample" I've sometimes seen folks use Cromwell's call-caching techniques where the entire-previous-dataset-plus-one is auto-generated as a new input json, and then it's left to the WDL/Cromwell to figure out that a number of the jobs have already been run. In that case the majority of results are just copied/linked and only the "+ 1" job(s) are run.
Hope this helps!
Hi @kshakir,
Thanks for your reply. I ended up creating a custom program that keeps track of all isolates and their MLST type, and fires off a new cromwell analysis if it encounters a new isolate of a certain type. That way, cromwell keeps track of the jobs that have already been run, as you suggested.