Is this error caused by a job submission failure?

mmahmmah Member, Broadie

I am encountering an error with Cromwell v26 running on LSF and SLURM backends in standalone mode. This error is not consistently reproducible, and I believe it may be related to trying to start too many jobs too quickly during a scatter operation and encountering job submission failures. I plan on addressing this with the concurrent job limit configuration, but am looking for information on whether there are other possible issues as well.

This is from Cromwell's standard output:

[ERROR] [05/12/2017 11:52:56.464] [cromwell-system-akka.dispatchers.backend-dispatcher-391] [akka://cromwell-system/user/SingleWorkflowRunnerActor/WorkflowManagerActor/WorkflowActor-6f436d30-39ab-454d-8a98-47f007976161/WorkflowExecutionActor-6f436d30-39ab-454d-8a98-47f007976161/6f436d30-39ab-454d-8a98-47f007976161-EngineJobExecutionActor-ancientDNA_screen.process_sample_hs37d5:58:1/6f436d30-39ab-454d-8a98-47f007976161-BackendJobExecutionActor-6f436d30:ancientDNA_screen.process_sample_hs37d5:58:1/DispatchedConfigAsyncJobExecutionActor] DispatchedConfigAsyncJobExecutionActor [UUID(6f436d30)ancientDNA_screen.process_sample_hs37d5:58:1]: Error attempting to Execute
java.lang.NullPointerException
    at cromwell.backend.standard.StandardAsyncExecutionActor$class.ec(StandardAsyncExecutionActor.scala:695)
    at cromwell.backend.impl.sfs.config.DispatchedConfigAsyncJobExecutionActor.ec(ConfigAsyncJobExecutionActor.scala:121)
    at cromwell.backend.impl.sfs.config.DispatchedConfigAsyncJobExecutionActor.ec(ConfigAsyncJobExecutionActor.scala:121)
    at cromwell.backend.standard.StandardAsyncExecutionActor$class.tellKvJobId(StandardAsyncExecutionActor.scala:682)
    at cromwell.backend.impl.sfs.config.DispatchedConfigAsyncJobExecutionActor.tellKvJobId(ConfigAsyncJobExecutionActor.scala:121)
    at cromwell.backend.standard.StandardAsyncExecutionActor$class.cromwell$backend$standard$StandardAsyncExecutionActor$$executeOrRecoverSuccess(StandardAsyncExecutionActor.scala:532)
    at cromwell.backend.standard.StandardAsyncExecutionActor$$anonfun$executeOrRecover$2.apply(StandardAsyncExecutionActor.scala:521)
    at cromwell.backend.standard.StandardAsyncExecutionActor$$anonfun$executeOrRecover$2.apply(StandardAsyncExecutionActor.scala:521)
    at scala.concurrent.Future$$anonfun$flatMap$1.apply(Future.scala:253)
    at scala.concurrent.Future$$anonfun$flatMap$1.apply(Future.scala:251)
    at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
    at akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55)
    at akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:91)
    at akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91)
    at akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91)
    at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
    at akka.dispatch.BatchingExecutor$BlockableBatch.run(BatchingExecutor.scala:90)
    at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:39)
    at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:415)
    at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
    at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
    at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
    at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

Best Answer

Answers

  • jgentryjgentry Member, Broadie, Dev

    Hi @mmah - we've definitely seen that error before. Interestingly it happens due to a situation which the creators of the library we're using there says is impossible but clearly not :)

    I don't remember if this was tracked down & resolved or not, I'll investigate. If not I'll open an issue or attach this to an existing one as appropriate.

  • alongaloralongalor Member
    edited May 8

    I am experiencing a similar issue using Cromwell (v31) on SLURM, namely - if I am running the GATK4 Pre-Processing WDL (https://github.com/gatk-workflows/gatk4-data-processing) on ~3-4 samples simultaneously, everything is perfect but when I run 10+ samples simultaneously, and many Cromwell jobs are submitted at the same time, many of the samples fail. Any help would be much appreciated.

  • mmahmmah Member, Broadie

    @alongalor The problem I sometimes encounter with SLURM is a job submission to the SLURM scheduler will fail due to a socket timeout when many submissions are made in short time. Cromwell properly detects this as an error and will start cleanup of the workflow. Subsequent errors appear like the one I described in the original question.

    My understanding is that Cromwell 31 has a feature to limit the rate of job submissions, which should help avoid overloading the SLURM scheduler:

    The rate at which jobs are being started can now be controlled using the system.job-rate-control configuration stanza.

    I have not personally used Cromwell 31 yet, but I would suggest you try this feature. Someone with the Cromwell team may be able to give you more ideas.

  • alongaloralongalor Member

    @mmah thank you so much for your swift reply! That is exactly the same problem I am currently facing! So far, after reading the updated cromwell documentation and scouring the cromwell.examples.conf file, I have updated my overrides.conf file to read as shown below, but unfortunately am still experiencing the same errors (listing a fairly representative one below). Any help from the Cromwell team or any advice you might have would be extremely appreciated!

    system { 
      # Number of seconds between workflow launches
      new-workflow-poll-rate = 60
      # Cromwell will launch up to N submitted workflows at a time, regardless of how many open workflow 
      slots exist
      max-workflow-launch-count = 1
      io {
      # Number of times an I/O operation should be attempted before giving up and failing it.
      number-of-attempts = 10
      }
    }
    
    services {
      LoadController {
        config {
          control-frequency = 5 seconds
        }
      }
    
      HealthMonitor {
        config {
          # For any given status check, how many times to retry a failure before setting status to failed. Note this
          # is the number of retries before declaring failure, not the total number of tries which is 1 more than
          # the number of retries.
          check-failure-retry-count = 50
          # For any given status check, how long to wait between failure retries.
          check-failure-retry-interval = 30 seconds
        }
      }
    }
    
    backend {
      # Override the default backend.
      default = "SLURM"
    
      # The list of providers.
      providers {
    
        SLURM {
          actor-factory = "cromwell.backend.impl.sfs.config.ConfigBackendLifecycleActorFactory"
          config {        
    
            concurrent-job-limit = 1000
    
            runtime-attributes = """
            Int runtime_minutes = 36000
            Int cpus = 1
            String requested_memory_mb_per_core = "8000"
            String queue = "park"
            String account_name = "park_contrib"
            """
    
            submit = """
                sbatch -J ${job_name} -D ${cwd} -o ${out} -e ${err} -t ${runtime_minutes} -p ${queue} \
                ${"-n " + cpus} ${"[email protected]"} \
                --mem-per-cpu=${requested_memory_mb_per_core} \
                --account=${account_name} \
                --wrap "/bin/bash ${script}"
            """
            kill = "scancel ${job_id}"
            check-alive = "squeue -j ${job_id}"
            job-id-regex = "Submitted batch job (\\d+).*"
          }
        }
    
      }
    }
    

    A typical error (attaching respective output log):

    [2018-05-10 00:06:24,38] [info] DispatchedConfigAsyncJobExecutionActor [ESC[38;5;2m4dd7a539ESC[0mPreProcessingForVariantDiscovery_GATK4.MarkDuplicates:NA:1]: ESC[38;5;5mjava -Dsamjdk.compression_level=5 -Xms4000
    m -Xmx6000m -jar /n/data1/hms/dbmi/park/alon/software/picard.jar \
      MarkDuplicates \
      INPUT=/n/data1/hms/dbmi/park/DATA/Li_single_cell_BLCA/.PreProcessing/.SRR475154.bam/.sh/cromwell-executions/PreProcessingForVariantDiscovery_GATK4/4dd7a539-717c-4fbe-b95f-7abf8db2dac9/call-MarkDuplicates/input
    s/n/data1/hms/dbmi/park/DATA/Li_single_cell_BLCA/.PreProcessing/.SRR475154.bam/.sh/cromwell-executions/PreProcessingForVariantDiscovery_GATK4/4dd7a539-717c-4fbe-b95f-7abf8db2dac9/call-MergeBamAlignment/shard-0/e
    xecution/SRR475154.aligned.unsorted.bam \
      OUTPUT=SRR475154.hg38.aligned.unsorted.duplicates_marked.bam \
      METRICS_FILE=SRR475154.hg38.duplicate_metrics \
      VALIDATION_STRINGENCY=SILENT \
      OPTICAL_DUPLICATE_PIXEL_DISTANCE=2500 \
      ASSUME_SORT_ORDER="queryname"
      CREATE_MD5_FILE=trueESC[0m
    [2018-05-10 00:06:24,40] [info] DispatchedConfigAsyncJobExecutionActor [ESC[38;5;2m4dd7a539ESC[0mPreProcessingForVariantDiscovery_GATK4.MarkDuplicates:NA:1]: executing: sbatch -J cromwell_4dd7a539_MarkDuplicates
     -D /n/data1/hms/dbmi/park/DATA/Li_single_cell_BLCA/.PreProcessing/.SRR475154.bam/.sh/cromwell-executions/PreProcessingForVariantDiscovery_GATK4/4dd7a539-717c-4fbe-b95f-7abf8db2dac9/call-MarkDuplicates -o /n/dat
    a1/hms/dbmi/park/DATA/Li_single_cell_BLCA/.PreProcessing/.SRR475154.bam/.sh/cromwell-executions/PreProcessingForVariantDiscovery_GATK4/4dd7a539-717c-4fbe-b95f-7abf8db2dac9/call-MarkDuplicates/execution/stdout -e
     /n/data1/hms/dbmi/park/DATA/Li_single_cell_BLCA/.PreProcessing/.SRR475154.bam/.sh/cromwell-executions/PreProcessingForVariantDiscovery_GATK4/4dd7a539-717c-4fbe-b95f-7abf8db2dac9/call-MarkDuplicates/execution/st
    derr -t 36000 -p park \
    -n 4 [email protected] \
    --mem-per-cpu=7000 \
    --account=park_contrib \
    --wrap "/bin/bash /n/data1/hms/dbmi/park/DATA/Li_single_cell_BLCA/.PreProcessing/.SRR475154.bam/.sh/cromwell-executions/PreProcessingForVariantDiscovery_GATK4/4dd7a539-717c-4fbe-b95f-7abf8db2dac9/call-MarkDuplic
    ates/execution/script"
    [2018-05-10 00:06:26,10] [info] DispatchedConfigAsyncJobExecutionActor [ESC[38;5;2m4dd7a539ESC[0mPreProcessingForVariantDiscovery_GATK4.MarkDuplicates:NA:1]: job id: 13954991
    [2018-05-10 00:06:26,12] [info] DispatchedConfigAsyncJobExecutionActor [ESC[38;5;2m4dd7a539ESC[0mPreProcessingForVariantDiscovery_GATK4.MarkDuplicates:NA:1]: Status change from - to WaitingForReturnCodeFile
    Uncaught error from thread [cromwell-system-akka.dispatchers.backend-dispatcher-487]: unable to create new native thread, shutting down JVM since 'akka.jvm-exit-on-fatal-error' is enabled for for ActorSystem[cro
    mwell-system]
    java.lang.OutOfMemoryError: unable to create new native thread
            at java.lang.Thread.start0(Native Method)
            at java.lang.Thread.start(Thread.java:714)
            at akka.dispatch.forkjoin.ForkJoinPool.tryAddWorker(ForkJoinPool.java:1672)
            at akka.dispatch.forkjoin.ForkJoinPool.signalWork(ForkJoinPool.java:1966)
            at akka.dispatch.forkjoin.ForkJoinPool.externalPush(ForkJoinPool.java:1829)
            at akka.dispatch.forkjoin.ForkJoinPool.execute(ForkJoinPool.java:2955)
            at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinPool.execute(ForkJoinExecutorConfigurator.scala:29)
            at akka.dispatch.ExecutorServiceDelegate.execute(ThreadPoolBuilder.scala:211)
            at akka.dispatch.ExecutorServiceDelegate.execute$(ThreadPoolBuilder.scala:211)
            at akka.dispatch.Dispatcher$LazyExecutorServiceDelegate.execute(Dispatcher.scala:39)
            at akka.dispatch.Dispatcher.registerForExecution(Dispatcher.scala:115)
            at akka.dispatch.Dispatcher.dispatch(Dispatcher.scala:55)
            at akka.actor.dungeon.Dispatch.sendMessage(Dispatch.scala:136)
            at akka.actor.dungeon.Dispatch.sendMessage$(Dispatch.scala:130)
            at akka.actor.ActorCell.sendMessage(ActorCell.scala:370)
            at akka.actor.Cell.sendMessage(ActorCell.scala:291)
            at akka.actor.Cell.sendMessage$(ActorCell.scala:290)
            at akka.actor.ActorCell.sendMessage(ActorCell.scala:370)
    at akka.actor.Cell.sendMessage(ActorCell.scala:291)
            at akka.actor.Cell.sendMessage$(ActorCell.scala:290)
            at akka.actor.ActorCell.sendMessage(ActorCell.scala:370)
            at akka.actor.LocalActorRef.$bang(ActorRef.scala:400)
            at cromwell.backend.async.AsyncBackendJobExecutionActor.$anonfun$robustPoll$2(AsyncBackendJobExecutionActor.scala:77)
            at cromwell.backend.async.AsyncBackendJobExecutionActor.$anonfun$robustPoll$2$adapted(AsyncBackendJobExecutionActor.scala:76)
            at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:60)
            at akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55)
            at akka.dispatch.BatchingExecutor$BlockableBatch.$anonfun$run$1(BatchingExecutor.scala:91)
            at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12)
            at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:81)
            at akka.dispatch.BatchingExecutor$BlockableBatch.run(BatchingExecutor.scala:91)
            at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40)
            at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:43)
            at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
            at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
            at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
            at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
    [2018-05-10 00:53:45,59] [ESC[38;5;1merrorESC[0m] Uncaught error from thread [cromwell-system-akka.dispatchers.backend-dispatcher-487]: unable to create new native thread, shutting down JVM since 'akka.jvm-exit-on-fatal-error' is enabled for ActorSystem[cromwell-system]
    java.lang.OutOfMemoryError: unable to create new native thread
            at java.lang.Thread.start0(Native Method)
            at java.lang.Thread.start(Thread.java:714)
            at akka.dispatch.forkjoin.ForkJoinPool.tryAddWorker(ForkJoinPool.java:1672)
            at akka.dispatch.forkjoin.ForkJoinPool.signalWork(ForkJoinPool.java:1966)
            at akka.dispatch.forkjoin.ForkJoinPool.externalPush(ForkJoinPool.java:1829)
            at akka.dispatch.forkjoin.ForkJoinPool.execute(ForkJoinPool.java:2955)
            at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinPool.execute(ForkJoinExecutorConfigurator.scala:29)
            at akka.dispatch.ExecutorServiceDelegate.execute(ThreadPoolBuilder.scala:211)
            at akka.dispatch.ExecutorServiceDelegate.execute$(ThreadPoolBuilder.scala:211)
            at akka.dispatch.Dispatcher$LazyExecutorServiceDelegate.execute(Dispatcher.scala:39)
            at akka.dispatch.Dispatcher.registerForExecution(Dispatcher.scala:115)
            at akka.dispatch.Dispatcher.dispatch(Dispatcher.scala:55)
            at akka.actor.dungeon.Dispatch.sendMessage(Dispatch.scala:136)
            at akka.actor.dungeon.Dispatch.sendMessage$(Dispatch.scala:130)
            at akka.actor.ActorCell.sendMessage(ActorCell.scala:370)
            at akka.actor.Cell.sendMessage(ActorCell.scala:291)
            at akka.actor.Cell.sendMessage$(ActorCell.scala:290)
            at akka.actor.ActorCell.sendMessage(ActorCell.scala:370)
            at akka.actor.LocalActorRef.$bang(ActorRef.scala:400)
            at cromwell.backend.async.AsyncBackendJobExecutionActor.$anonfun$robustPoll$2(AsyncBackendJobExecutionActor.scala:77)
            at cromwell.backend.async.AsyncBackendJobExecutionActor.$anonfun$robustPoll$2$adapted(AsyncBackendJobExecutionActor.scala:76)
            at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:60)
            at akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55)
            at akka.dispatch.BatchingExecutor$BlockableBatch.$anonfun$run$1(BatchingExecutor.scala:91)
            at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12)
            at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:81)
            at akka.dispatch.BatchingExecutor$BlockableBatch.run(BatchingExecutor.scala:91)
            at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40)
            at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:43)
            at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
            at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
            at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
            at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
    

    Thanks a lot,

    Alon

  • mmahmmah Member, Broadie

    @alongalor Your error is not the same.

    java.lang.OutOfMemoryError: unable to create new native thread

    To me, this looks like you are not allocating enough memory for the process running Cromwell.

  • alongaloralongalor Member
    edited May 17

    This was my first instinct as well, however by just modifying the parameters in my overrides.conf file, keeping the input data constant, I was able to get this to work!

    Here is my updated overrides.conf file in case you're interested!

    # This line is required. It pulls in default overrides from the embedded cromwell `application.conf` needed for proper
    # performance of cromwell.
    include required("application")
    
    system { 
      # If 'true' then when Cromwell starts up, it tries to restart incomplete workflows
      #workflow-restart = true
      # Max number of retries per job that the engine will attempt in case of a retryable failure received from the backend
      max-retries = 50
      # Number of seconds between workflow launches
      new-workflow-poll-rate = 60
      max-workflow-launch-count = 1
      io {
      # Global Throttling - This is mostly useful for GCS and can be adjusted to match
      # the quota availble on the GCS API
      #number-of-requests = 100000
      #per = 100 seconds
    
      # Number of times an I/O operation should be attempted before giving up and failing it.
      number-of-attempts = 10
      }
    }
    
    services {
      LoadController {
        config {
          control-frequency = 5 seconds
        }
      }
    
      HealthMonitor {
        config {
          # How long to wait between status check sweeps
          # check-refresh-time = 5 minutes
          # For any given status check, how long to wait before assuming failure
          # check-timeout = 1 minute
          # For any given status datum, the maximum time a value will be kept before reverting back to "Unknown"
          # status-ttl = 15 minutes
          # For any given status check, how many times to retry a failure before setting status to failed. Note this
          # is the number of retries before declaring failure, not the total number of tries which is 1 more than
          # the number of retries.
          check-failure-retry-count = 50
          # For any given status check, how long to wait between failure retries.
          check-failure-retry-interval = 30 seconds
        }
      }
    }
    
    backend {
      # Override the default backend.
      default = "SLURM"
    
      # The list of providers.
      providers {
    
        SLURM {
          actor-factory = "cromwell.backend.impl.sfs.config.ConfigBackendLifecycleActorFactory"
          config {        
    
            concurrent-job-limit = 1000
    
            runtime-attributes = """
            Int runtime_minutes = 36000
            Int cpus = 1
            String requested_memory_mb_per_core = "8000"
            String queue = "park"
            String account_name = "park_contrib"
            """
    
            submit = """
                sbatch -J ${job_name} -D ${cwd} -o ${out} -e ${err} -t ${runtime_minutes} -p ${queue} \
                ${"-n " + cpus} ${"--mail-u[email protected]"} \
                --mem-per-cpu=${requested_memory_mb_per_core} \
                --account=${account_name} \
                --wrap "/bin/bash ${script}"
            """
            kill = "scancel ${job_id}"
            check-alive = "squeue -j ${job_id}"
            job-id-regex = "Submitted batch job (\\d+).*"
          }
        }
    
      }
    }
    
    
Sign In or Register to comment.