Cromwell (Java) out of memory, unable to create new native thread

I started experiencing issues with Cromwell shutting down with the complain attached below. What does it mean and how to circumvent it?

Uncaught error from thread [cromwell-system-akka.dispatchers.backend-dispatcher-218]: unable to create new native thread, shutting down JVM since 'akka.jvm-exit-on-fatal-error' is enabled
for for ActorSystem[cromwell-system]
java.lang.OutOfMemoryError: unable to create new native thread

Best Answer

Answers

  • EADGEADG KielMember ✭✭✭

    Hi @dplichta ,

    hm short question did you use the in-memory database from cromwell ? If so, you maybe consider to switch to a seperate database. Which cromwell version do you use ?

    Greetings EADG

  • dplichtadplichta Member

    Hi @EADG,

    I am using cromwell-32 and point to a locally run mysql server for bookkeeping. It's a setup that I share with other members of the lab on a server with 24 cpus and 250gb RAM. We observed this issue when 3 separate instances of cromwell were running (trying to run) at the same time. I noticed that cromwell/java normally uses a lot of swap memory (from top command: VIRT=45g, RES=2g) but don't know if that the issue.

    Damian

  • EADGEADG KielMember ✭✭✭

    Hi @dplichta,

    Ok, I had to refresh my java knowledge a little bit but I think I have found an answer to your question.

    At first, it is not a problem of enough memory, it a problem caused by the OS which can only spam a limited number of threads. You can find a good description here:
    plumbr.io:unable-to-create-new-native-thread and here: stackoverflow.com:java-lang-outofmemoryerror-unable-to-create-new-native-thread

    By default Cromwell is running a lot of concurrent jobs if you run 3 Instances with workflows which contains a lot scatter/gather operations you may exceed the thread limit of your system.

    A possible solution for this is to set the limit of concurrent jobs in your Cromwell application.file like:

    backend.providers.Local.config.concurrent-job-limit = 100
    

    Hope this helps,
    Greetings EADG

  • dplichtadplichta Member

    Thank you EDAG and ChrisL.

    It seems that the server mode is what we need.

    The thing that I liked about separate cromwell runs, was that I could use a different config file for different projects / users with different root bucket location in google cloud to which cromwell writes ("Base bucket for workflow executions"). Can that setting be changed for different workflow executions? Especially with many lab members and workflows testings it will be difficult to keep track of who is producing which outputs. (this question is touching upon a good cromwell practice - if there is a good resource about it to explore, please share).

    Damian

  • EADGEADG KielMember ✭✭✭

    Hi @deplichta,

    just declare output-directories via the command line or an options.json like:

    {
     "final_workflow_outputs_dir" : "Replace-ResultPath",
     "workflow-log-dir" : "Replace-LogPath",
     "final_call_logs_dir": "Replace-LogCallPath",
     "workflow-log-temporary" : "false"
    }
    
    

    And start your workflows like:

    curl -v "localhost:8000/api/workflows/v1" -F [email protected] -F [email protected] -F [email protected] 
    

    Cromwell will copy all results to the declared path and also will hold a copy at the current running location.

    Maybe it is a good idea, in this case, to limit the number of concurrent workflows to avoid an overload of the server. To do so just add this lines to your config:

    system {
      max-concurrent-workflows = 1
      new-workflow-poll-rate = 1
      max-workflow-launch-count = 1
      abort-jobs-on-terminate=true
    }
    

    Greetings EADG

  • dplichtadplichta Member

    Thanks @EADG for introducing me to "curl ..". I will test the system settings asap.

    Damian

  • blueskypyblueskypy Member ✭✭

    Hi, @ChrisL, I have exactly the same problem. But how do I run cromwell server? The tutorial shows submitting jobs through webpage. But I can only ssh into a HPC cluster, and I have hundreds of samples, i.e., I cannot submit each of them manually.

Sign In or Register to comment.