Is this error caused by a job submission failure?

I am encountering an error with Cromwell v26 running on LSF and SLURM backends in standalone mode. This error is not consistently reproducible, and I believe it may be related to trying to start too many jobs too quickly during a scatter operation and encountering job submission failures. I plan on addressing this with the concurrent job limit configuration, but am looking for information on whether there are other possible issues as well.

This is from Cromwell's standard output:

[ERROR] [05/12/2017 11:52:56.464] [cromwell-system-akka.dispatchers.backend-dispatcher-391] [akka://cromwell-system/user/SingleWorkflowRunnerActor/WorkflowManagerActor/WorkflowActor-6f436d30-39ab-454d-8a98-47f007976161/WorkflowExecutionActor-6f436d30-39ab-454d-8a98-47f007976161/6f436d30-39ab-454d-8a98-47f007976161-EngineJobExecutionActor-ancientDNA_screen.process_sample_hs37d5:58:1/6f436d30-39ab-454d-8a98-47f007976161-BackendJobExecutionActor-6f436d30:ancientDNA_screen.process_sample_hs37d5:58:1/DispatchedConfigAsyncJobExecutionActor] DispatchedConfigAsyncJobExecutionActor [UUID(6f436d30)ancientDNA_screen.process_sample_hs37d5:58:1]: Error attempting to Execute
java.lang.NullPointerException
    at cromwell.backend.standard.StandardAsyncExecutionActor$class.ec(StandardAsyncExecutionActor.scala:695)
    at cromwell.backend.impl.sfs.config.DispatchedConfigAsyncJobExecutionActor.ec(ConfigAsyncJobExecutionActor.scala:121)
    at cromwell.backend.impl.sfs.config.DispatchedConfigAsyncJobExecutionActor.ec(ConfigAsyncJobExecutionActor.scala:121)
    at cromwell.backend.standard.StandardAsyncExecutionActor$class.tellKvJobId(StandardAsyncExecutionActor.scala:682)
    at cromwell.backend.impl.sfs.config.DispatchedConfigAsyncJobExecutionActor.tellKvJobId(ConfigAsyncJobExecutionActor.scala:121)
    at cromwell.backend.standard.StandardAsyncExecutionActor$class.cromwell$backend$standard$StandardAsyncExecutionActor$$executeOrRecoverSuccess(StandardAsyncExecutionActor.scala:532)
    at cromwell.backend.standard.StandardAsyncExecutionActor$$anonfun$executeOrRecover$2.apply(StandardAsyncExecutionActor.scala:521)
    at cromwell.backend.standard.StandardAsyncExecutionActor$$anonfun$executeOrRecover$2.apply(StandardAsyncExecutionActor.scala:521)
    at scala.concurrent.Future$$anonfun$flatMap$1.apply(Future.scala:253)
    at scala.concurrent.Future$$anonfun$flatMap$1.apply(Future.scala:251)
    at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
    at akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55)
    at akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:91)
    at akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91)
    at akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91)
    at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
    at akka.dispatch.BatchingExecutor$BlockableBatch.run(BatchingExecutor.scala:90)
    at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:39)
    at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:415)
    at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
    at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
    at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
    at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

Best Answer

Answers

  • jgentryjgentry Member, Broadie, Dev

    Hi @mmah - we've definitely seen that error before. Interestingly it happens due to a situation which the creators of the library we're using there says is impossible but clearly not :)

    I don't remember if this was tracked down & resolved or not, I'll investigate. If not I'll open an issue or attach this to an existing one as appropriate.

  • alongaloralongalor Member
    edited May 8

    I am experiencing a similar issue using Cromwell (v31) on SLURM, namely - if I am running the GATK4 Pre-Processing WDL (https://github.com/gatk-workflows/gatk4-data-processing) on ~3-4 samples simultaneously, everything is perfect but when I run 10+ samples simultaneously, and many Cromwell jobs are submitted at the same time, many of the samples fail. Any help would be much appreciated.

  • mmahmmah Member, Broadie

    @alongalor The problem I sometimes encounter with SLURM is a job submission to the SLURM scheduler will fail due to a socket timeout when many submissions are made in short time. Cromwell properly detects this as an error and will start cleanup of the workflow. Subsequent errors appear like the one I described in the original question.

    My understanding is that Cromwell 31 has a feature to limit the rate of job submissions, which should help avoid overloading the SLURM scheduler:

    The rate at which jobs are being started can now be controlled using the system.job-rate-control configuration stanza.

    I have not personally used Cromwell 31 yet, but I would suggest you try this feature. Someone with the Cromwell team may be able to give you more ideas.

  • alongaloralongalor Member

    @mmah thank you so much for your swift reply! That is exactly the same problem I am currently facing! So far, after reading the updated cromwell documentation and scouring the cromwell.examples.conf file, I have updated my overrides.conf file to read as shown below, but unfortunately am still experiencing the same errors (listing a fairly representative one below). Any help from the Cromwell team or any advice you might have would be extremely appreciated!

    system { 
      # Number of seconds between workflow launches
      new-workflow-poll-rate = 60
      # Cromwell will launch up to N submitted workflows at a time, regardless of how many open workflow 
      slots exist
      max-workflow-launch-count = 1
      io {
      # Number of times an I/O operation should be attempted before giving up and failing it.
      number-of-attempts = 10
      }
    }
    
    services {
      LoadController {
        config {
          control-frequency = 5 seconds
        }
      }
    
      HealthMonitor {
        config {
          # For any given status check, how many times to retry a failure before setting status to failed. Note this
          # is the number of retries before declaring failure, not the total number of tries which is 1 more than
          # the number of retries.
          check-failure-retry-count = 50
          # For any given status check, how long to wait between failure retries.
          check-failure-retry-interval = 30 seconds
        }
      }
    }
    
    backend {
      # Override the default backend.
      default = "SLURM"
    
      # The list of providers.
      providers {
    
        SLURM {
          actor-factory = "cromwell.backend.impl.sfs.config.ConfigBackendLifecycleActorFactory"
          config {        
    
            concurrent-job-limit = 1000
    
            runtime-attributes = """
            Int runtime_minutes = 36000
            Int cpus = 1
            String requested_memory_mb_per_core = "8000"
            String queue = "park"
            String account_name = "park_contrib"
            """
    
            submit = """
                sbatch -J ${job_name} -D ${cwd} -o ${out} -e ${err} -t ${runtime_minutes} -p ${queue} \
                ${"-n " + cpus} ${"[email protected]"} \
                --mem-per-cpu=${requested_memory_mb_per_core} \
                --account=${account_name} \
                --wrap "/bin/bash ${script}"
            """
            kill = "scancel ${job_id}"
            check-alive = "squeue -j ${job_id}"
            job-id-regex = "Submitted batch job (\\d+).*"
          }
        }
    
      }
    }
    

    A typical error (attaching respective output log):

    [2018-05-10 00:06:24,38] [info] DispatchedConfigAsyncJobExecutionActor [ESC[38;5;2m4dd7a539ESC[0mPreProcessingForVariantDiscovery_GATK4.MarkDuplicates:NA:1]: ESC[38;5;5mjava -Dsamjdk.compression_level=5 -Xms4000
    m -Xmx6000m -jar /n/data1/hms/dbmi/park/alon/software/picard.jar \
      MarkDuplicates \
      INPUT=/n/data1/hms/dbmi/park/DATA/Li_single_cell_BLCA/.PreProcessing/.SRR475154.bam/.sh/cromwell-executions/PreProcessingForVariantDiscovery_GATK4/4dd7a539-717c-4fbe-b95f-7abf8db2dac9/call-MarkDuplicates/input
    s/n/data1/hms/dbmi/park/DATA/Li_single_cell_BLCA/.PreProcessing/.SRR475154.bam/.sh/cromwell-executions/PreProcessingForVariantDiscovery_GATK4/4dd7a539-717c-4fbe-b95f-7abf8db2dac9/call-MergeBamAlignment/shard-0/e
    xecution/SRR475154.aligned.unsorted.bam \
      OUTPUT=SRR475154.hg38.aligned.unsorted.duplicates_marked.bam \
      METRICS_FILE=SRR475154.hg38.duplicate_metrics \
      VALIDATION_STRINGENCY=SILENT \
      OPTICAL_DUPLICATE_PIXEL_DISTANCE=2500 \
      ASSUME_SORT_ORDER="queryname"
      CREATE_MD5_FILE=trueESC[0m
    [2018-05-10 00:06:24,40] [info] DispatchedConfigAsyncJobExecutionActor [ESC[38;5;2m4dd7a539ESC[0mPreProcessingForVariantDiscovery_GATK4.MarkDuplicates:NA:1]: executing: sbatch -J cromwell_4dd7a539_MarkDuplicates
     -D /n/data1/hms/dbmi/park/DATA/Li_single_cell_BLCA/.PreProcessing/.SRR475154.bam/.sh/cromwell-executions/PreProcessingForVariantDiscovery_GATK4/4dd7a539-717c-4fbe-b95f-7abf8db2dac9/call-MarkDuplicates -o /n/dat
    a1/hms/dbmi/park/DATA/Li_single_cell_BLCA/.PreProcessing/.SRR475154.bam/.sh/cromwell-executions/PreProcessingForVariantDiscovery_GATK4/4dd7a539-717c-4fbe-b95f-7abf8db2dac9/call-MarkDuplicates/execution/stdout -e
     /n/data1/hms/dbmi/park/DATA/Li_single_cell_BLCA/.PreProcessing/.SRR475154.bam/.sh/cromwell-executions/PreProcessingForVariantDiscovery_GATK4/4dd7a539-717c-4fbe-b95f-7abf8db2dac9/call-MarkDuplicates/execution/st
    derr -t 36000 -p park \
    -n 4 [email protected] \
    --mem-per-cpu=7000 \
    --account=park_contrib \
    --wrap "/bin/bash /n/data1/hms/dbmi/park/DATA/Li_single_cell_BLCA/.PreProcessing/.SRR475154.bam/.sh/cromwell-executions/PreProcessingForVariantDiscovery_GATK4/4dd7a539-717c-4fbe-b95f-7abf8db2dac9/call-MarkDuplic
    ates/execution/script"
    [2018-05-10 00:06:26,10] [info] DispatchedConfigAsyncJobExecutionActor [ESC[38;5;2m4dd7a539ESC[0mPreProcessingForVariantDiscovery_GATK4.MarkDuplicates:NA:1]: job id: 13954991
    [2018-05-10 00:06:26,12] [info] DispatchedConfigAsyncJobExecutionActor [ESC[38;5;2m4dd7a539ESC[0mPreProcessingForVariantDiscovery_GATK4.MarkDuplicates:NA:1]: Status change from - to WaitingForReturnCodeFile
    Uncaught error from thread [cromwell-system-akka.dispatchers.backend-dispatcher-487]: unable to create new native thread, shutting down JVM since 'akka.jvm-exit-on-fatal-error' is enabled for for ActorSystem[cro
    mwell-system]
    java.lang.OutOfMemoryError: unable to create new native thread
            at java.lang.Thread.start0(Native Method)
            at java.lang.Thread.start(Thread.java:714)
            at akka.dispatch.forkjoin.ForkJoinPool.tryAddWorker(ForkJoinPool.java:1672)
            at akka.dispatch.forkjoin.ForkJoinPool.signalWork(ForkJoinPool.java:1966)
            at akka.dispatch.forkjoin.ForkJoinPool.externalPush(ForkJoinPool.java:1829)
            at akka.dispatch.forkjoin.ForkJoinPool.execute(ForkJoinPool.java:2955)
            at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinPool.execute(ForkJoinExecutorConfigurator.scala:29)
            at akka.dispatch.ExecutorServiceDelegate.execute(ThreadPoolBuilder.scala:211)
            at akka.dispatch.ExecutorServiceDelegate.execute$(ThreadPoolBuilder.scala:211)
            at akka.dispatch.Dispatcher$LazyExecutorServiceDelegate.execute(Dispatcher.scala:39)
            at akka.dispatch.Dispatcher.registerForExecution(Dispatcher.scala:115)
            at akka.dispatch.Dispatcher.dispatch(Dispatcher.scala:55)
            at akka.actor.dungeon.Dispatch.sendMessage(Dispatch.scala:136)
            at akka.actor.dungeon.Dispatch.sendMessage$(Dispatch.scala:130)
            at akka.actor.ActorCell.sendMessage(ActorCell.scala:370)
            at akka.actor.Cell.sendMessage(ActorCell.scala:291)
            at akka.actor.Cell.sendMessage$(ActorCell.scala:290)
            at akka.actor.ActorCell.sendMessage(ActorCell.scala:370)
    at akka.actor.Cell.sendMessage(ActorCell.scala:291)
            at akka.actor.Cell.sendMessage$(ActorCell.scala:290)
            at akka.actor.ActorCell.sendMessage(ActorCell.scala:370)
            at akka.actor.LocalActorRef.$bang(ActorRef.scala:400)
            at cromwell.backend.async.AsyncBackendJobExecutionActor.$anonfun$robustPoll$2(AsyncBackendJobExecutionActor.scala:77)
            at cromwell.backend.async.AsyncBackendJobExecutionActor.$anonfun$robustPoll$2$adapted(AsyncBackendJobExecutionActor.scala:76)
            at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:60)
            at akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55)
            at akka.dispatch.BatchingExecutor$BlockableBatch.$anonfun$run$1(BatchingExecutor.scala:91)
            at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12)
            at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:81)
            at akka.dispatch.BatchingExecutor$BlockableBatch.run(BatchingExecutor.scala:91)
            at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40)
            at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:43)
            at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
            at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
            at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
            at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
    [2018-05-10 00:53:45,59] [ESC[38;5;1merrorESC[0m] Uncaught error from thread [cromwell-system-akka.dispatchers.backend-dispatcher-487]: unable to create new native thread, shutting down JVM since 'akka.jvm-exit-on-fatal-error' is enabled for ActorSystem[cromwell-system]
    java.lang.OutOfMemoryError: unable to create new native thread
            at java.lang.Thread.start0(Native Method)
            at java.lang.Thread.start(Thread.java:714)
            at akka.dispatch.forkjoin.ForkJoinPool.tryAddWorker(ForkJoinPool.java:1672)
            at akka.dispatch.forkjoin.ForkJoinPool.signalWork(ForkJoinPool.java:1966)
            at akka.dispatch.forkjoin.ForkJoinPool.externalPush(ForkJoinPool.java:1829)
            at akka.dispatch.forkjoin.ForkJoinPool.execute(ForkJoinPool.java:2955)
            at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinPool.execute(ForkJoinExecutorConfigurator.scala:29)
            at akka.dispatch.ExecutorServiceDelegate.execute(ThreadPoolBuilder.scala:211)
            at akka.dispatch.ExecutorServiceDelegate.execute$(ThreadPoolBuilder.scala:211)
            at akka.dispatch.Dispatcher$LazyExecutorServiceDelegate.execute(Dispatcher.scala:39)
            at akka.dispatch.Dispatcher.registerForExecution(Dispatcher.scala:115)
            at akka.dispatch.Dispatcher.dispatch(Dispatcher.scala:55)
            at akka.actor.dungeon.Dispatch.sendMessage(Dispatch.scala:136)
            at akka.actor.dungeon.Dispatch.sendMessage$(Dispatch.scala:130)
            at akka.actor.ActorCell.sendMessage(ActorCell.scala:370)
            at akka.actor.Cell.sendMessage(ActorCell.scala:291)
            at akka.actor.Cell.sendMessage$(ActorCell.scala:290)
            at akka.actor.ActorCell.sendMessage(ActorCell.scala:370)
            at akka.actor.LocalActorRef.$bang(ActorRef.scala:400)
            at cromwell.backend.async.AsyncBackendJobExecutionActor.$anonfun$robustPoll$2(AsyncBackendJobExecutionActor.scala:77)
            at cromwell.backend.async.AsyncBackendJobExecutionActor.$anonfun$robustPoll$2$adapted(AsyncBackendJobExecutionActor.scala:76)
            at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:60)
            at akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55)
            at akka.dispatch.BatchingExecutor$BlockableBatch.$anonfun$run$1(BatchingExecutor.scala:91)
            at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12)
            at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:81)
            at akka.dispatch.BatchingExecutor$BlockableBatch.run(BatchingExecutor.scala:91)
            at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40)
            at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:43)
            at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
            at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
            at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
            at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
    

    Thanks a lot,

    Alon

  • mmahmmah Member, Broadie

    @alongalor Your error is not the same.

    java.lang.OutOfMemoryError: unable to create new native thread

    To me, this looks like you are not allocating enough memory for the process running Cromwell.

  • alongaloralongalor Member
    edited May 17

    This was my first instinct as well, however by just modifying the parameters in my overrides.conf file, keeping the input data constant, I was able to get this to work!

    Here is my updated overrides.conf file in case you're interested!

    # This line is required. It pulls in default overrides from the embedded cromwell `application.conf` needed for proper
    # performance of cromwell.
    include required("application")
    
    system { 
      # If 'true' then when Cromwell starts up, it tries to restart incomplete workflows
      #workflow-restart = true
      # Max number of retries per job that the engine will attempt in case of a retryable failure received from the backend
      max-retries = 50
      # Number of seconds between workflow launches
      new-workflow-poll-rate = 60
      max-workflow-launch-count = 1
      io {
      # Global Throttling - This is mostly useful for GCS and can be adjusted to match
      # the quota availble on the GCS API
      #number-of-requests = 100000
      #per = 100 seconds
    
      # Number of times an I/O operation should be attempted before giving up and failing it.
      number-of-attempts = 10
      }
    }
    
    services {
      LoadController {
        config {
          control-frequency = 5 seconds
        }
      }
    
      HealthMonitor {
        config {
          # How long to wait between status check sweeps
          # check-refresh-time = 5 minutes
          # For any given status check, how long to wait before assuming failure
          # check-timeout = 1 minute
          # For any given status datum, the maximum time a value will be kept before reverting back to "Unknown"
          # status-ttl = 15 minutes
          # For any given status check, how many times to retry a failure before setting status to failed. Note this
          # is the number of retries before declaring failure, not the total number of tries which is 1 more than
          # the number of retries.
          check-failure-retry-count = 50
          # For any given status check, how long to wait between failure retries.
          check-failure-retry-interval = 30 seconds
        }
      }
    }
    
    backend {
      # Override the default backend.
      default = "SLURM"
    
      # The list of providers.
      providers {
    
        SLURM {
          actor-factory = "cromwell.backend.impl.sfs.config.ConfigBackendLifecycleActorFactory"
          config {        
    
            concurrent-job-limit = 1000
    
            runtime-attributes = """
            Int runtime_minutes = 36000
            Int cpus = 1
            String requested_memory_mb_per_core = "8000"
            String queue = "park"
            String account_name = "park_contrib"
            """
    
            submit = """
                sbatch -J ${job_name} -D ${cwd} -o ${out} -e ${err} -t ${runtime_minutes} -p ${queue} \
                ${"-n " + cpus} ${"[email protected]"} \
                --mem-per-cpu=${requested_memory_mb_per_core} \
                --account=${account_name} \
                --wrap "/bin/bash ${script}"
            """
            kill = "scancel ${job_id}"
            check-alive = "squeue -j ${job_id}"
            job-id-regex = "Submitted batch job (\\d+).*"
          }
        }
    
      }
    }
    
    
  • Hi Cromwell Team!

    Unfortunately I am still running into countless errors upon running the standard joint-genotyping workflow that has been made public at https://github.com/gatk-workflows/gatk4-germline-snps-indels on our local SLURM cluster. I often find that if I run the same workflow exact workflow 5-10 times it may work 1-2 times, based on what appear to be random issues, perhaps pertaining to the interaction of Cromwell with the our cluster. I plan on setting up a MySQL database and writing a script to automatically restart failed workflows using results saved on the MySQL database to handle these failed workflows. FYI I'm attaching the error log and wondering if you have any other suggestions to remedy this issue.

    Below is one such error I have encountered several times now:

    [2018-05-19 03:19:44,56] [info] DispatchedConfigAsyncJobExecutionActor [c5fed572JointGenotyping.ImportGVCFs:3024:1]: Status change from WaitingForReturnCodeFile to Done
    [2018-05-19 03:19:56,77] [info] DispatchedConfigAsyncJobExecutionActor [c5fed572JointGenotyping.ImportGVCFs:3876:1]: Status change from WaitingForReturnCodeFile to Done
    [2018-05-19 03:20:03,45] [info] DispatchedConfigAsyncJobExecutionActor [c5fed572JointGenotyping.ImportGVCFs:79:1]: Status change from WaitingForReturnCodeFile to Done
    [2018-05-19 03:20:12,60] [info] DispatchedConfigAsyncJobExecutionActor [c5fed572JointGenotyping.ImportGVCFs:579:1]: Status change from WaitingForReturnCodeFile to Done
    [2018-05-19 03:20:14,52] [info] DispatchedConfigAsyncJobExecutionActor [c5fed572JointGenotyping.ImportGVCFs:1041:1]: Status change from WaitingForReturnCodeFile to Done
    [2018-05-19 03:20:34,02] [info] DispatchedConfigAsyncJobExecutionActor [c5fed572JointGenotyping.ImportGVCFs:919:1]: Status change from WaitingForReturnCodeFile to Done
    [2018-05-19 03:20:34,64] [error] WorkflowManagerActor Workflow c5fed572-4c03-4896-b777-5f752adb55e4 failed (during ExecutingWorkflowState): Job JointGenotyping.ImportGVCFs:1426:1 exited with return code 1 which has not been declared as a valid return code. See 'continueOnReturnCode' runtime attribute for more details.
    Check the content of stderr for potential additional information: /n/data1/hms/dbmi/park/DATA/Gawad_single_cell_ALL/.JointDiscovery/.sh/cromwell-executions/JointGenotyping/c5fed572-4c03-4896-b777-5f752adb55e4/call-ImportGVCFs/shard-1426/execution/stderr
    Unable to start job. Check the stderr file for possible errors: /n/data1/hms/dbmi/park/DATA/Gawad_single_cell_ALL/.JointDiscovery/.sh/cromwell-executions/JointGenotyping/c5fed572-4c03-4896-b777-5f752adb55e4/call-ImportGVCFs/shard-3162/execution/stderr.submit
    java.lang.RuntimeException: Unable to start job. Check the stderr file for possible errors: /n/data1/hms/dbmi/park/DATA/Gawad_single_cell_ALL/.JointDiscovery/.sh/cromwell-executions/JointGenotyping/c5fed572-4c03-4896-b777-5f752adb55e4/call-ImportGVCFs/shard-3162/execution/stderr.submit
        at cromwell.backend.sfs.SharedFileSystemAsyncJobExecutionActor.$anonfun$execute$2(SharedFileSystemAsyncJobExecutionActor.scala:130)
        at scala.util.Either.fold(Either.scala:188)
        at cromwell.backend.sfs.SharedFileSystemAsyncJobExecutionActor.execute(SharedFileSystemAsyncJobExecutionActor.scala:125)
        at cromwell.backend.sfs.SharedFileSystemAsyncJobExecutionActor.execute$(SharedFileSystemAsyncJobExecutionActor.scala:121)
        at cromwell.backend.impl.sfs.config.DispatchedConfigAsyncJobExecutionActor.execute(ConfigAsyncJobExecutionActor.scala:206)
        at cromwell.backend.standard.StandardAsyncExecutionActor.$anonfun$executeAsync$1(StandardAsyncExecutionActor.scala:451)
        at scala.util.Try$.apply(Try.scala:209)
        at cromwell.backend.standard.StandardAsyncExecutionActor.executeAsync(StandardAsyncExecutionActor.scala:451)
        at cromwell.backend.standard.StandardAsyncExecutionActor.executeAsync$(StandardAsyncExecutionActor.scala:451)
        at cromwell.backend.impl.sfs.config.DispatchedConfigAsyncJobExecutionActor.executeAsync(ConfigAsyncJobExecutionActor.scala:206)
        at cromwell.backend.standard.StandardAsyncExecutionActor.executeOrRecover(StandardAsyncExecutionActor.scala:744)
        at cromwell.backend.standard.StandardAsyncExecutionActor.executeOrRecover$(StandardAsyncExecutionActor.scala:736)
        at cromwell.backend.impl.sfs.config.DispatchedConfigAsyncJobExecutionActor.executeOrRecover(ConfigAsyncJobExecutionActor.scala:206)
        at cromwell.backend.async.AsyncBackendJobExecutionActor.$anonfun$robustExecuteOrRecover$1(AsyncBackendJobExecutionActor.scala:65)
        at cromwell.core.retry.Retry$.withRetry(Retry.scala:37)
        at cromwell.backend.async.AsyncBackendJobExecutionActor.withRetry(AsyncBackendJobExecutionActor.scala:61)
        at cromwell.backend.async.AsyncBackendJobExecutionActor.cromwell$backend$async$AsyncBackendJobExecutionActor$$robustExecuteOrRecover(AsyncBackendJobExecutionActor.scala:65)
        at cromwell.backend.async.AsyncBackendJobExecutionActor$$anonfun$receive$1.applyOrElse(AsyncBackendJobExecutionActor.scala:88)
        at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172)
        at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172)
        at akka.actor.Actor.aroundReceive(Actor.scala:514)
        at akka.actor.Actor.aroundReceive$(Actor.scala:512)
        at cromwell.backend.impl.sfs.config.DispatchedConfigAsyncJobExecutionActor.aroundReceive(ConfigAsyncJobExecutionActor.scala:206)
        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:527)
        at akka.actor.ActorCell.invoke(ActorCell.scala:496)
        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
        at akka.dispatch.Mailbox.run(Mailbox.scala:224)
        at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
        at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
        at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
        at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
        at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
    
    Job JointGenotyping.ImportGVCFs:2553:1 exited with return code 1 which has not been declared as a valid return code. See 'continueOnReturnCode' runtime attribute for more details.
    Check the content of stderr for potential additional information: /n/data1/hms/dbmi/park/DATA/Gawad_single_cell_ALL/.JointDiscovery/.sh/cromwell-executions/JointGenotyping/c5fed572-4c03-4896-b777-5f752adb55e4/call-ImportGVCFs/shard-2553/execution/stderr
    Job JointGenotyping.ImportGVCFs:556:1 exited with return code 1 which has not been declared as a valid return code. See 'continueOnReturnCode' runtime attribute for more details.
    Check the content of stderr for potential additional information: /n/data1/hms/dbmi/park/DATA/Gawad_single_cell_ALL/.JointDiscovery/.sh/cromwell-executions/JointGenotyping/c5fed572-4c03-4896-b777-5f752adb55e4/call-ImportGVCFs/shard-556/execution/stderr
    Job JointGenotyping.ImportGVCFs:3201:1 exited with return code 1 which has not been declared as a valid return code. See 'continueOnReturnCode' runtime attribute for more details.
    Check the content of stderr for potential additional information: /n/data1/hms/dbmi/park/DATA/Gawad_single_cell_ALL/.JointDiscovery/.sh/cromwell-executions/JointGenotyping/c5fed572-4c03-4896-b777-5f752adb55e4/call-ImportGVCFs/shard-3201/execution/stderr
    Job JointGenotyping.ImportGVCFs:3177:1 exited with return code 1 which has not been declared as a valid return code. See 'continueOnReturnCode' runtime attribute for more details.
    Check the content of stderr for potential additional information: /n/data1/hms/dbmi/park/DATA/Gawad_single_cell_ALL/.JointDiscovery/.sh/cromwell-executions/JointGenotyping/c5fed572-4c03-4896-b777-5f752adb55e4/call-ImportGVCFs/shard-3177/execution/stderr
    [2018-05-19 03:20:34,65] [info] WorkflowManagerActor WorkflowActor-c5fed572-4c03-4896-b777-5f752adb55e4 is in a terminal state: WorkflowFailedState
    [2018-05-19 03:21:10,43] [info] SingleWorkflowRunnerActor workflow finished with status 'Failed'.
    [2018-05-19 03:21:14,56] [info] Message [cromwell.core.actor.StreamActorHelper$StreamFailed] without sender to Actor[akka://cromwell-system/deadLetters] was not delivered. [1] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
    [2018-05-19 03:21:14,56] [info] Message [cromwell.core.actor.StreamActorHelper$StreamFailed] without sender to Actor[akka://cromwell-system/deadLetters] was not delivered. [2] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
    [2018-05-19 03:21:14,56] [info] Message [cromwell.core.actor.StreamActorHelper$StreamFailed] without sender to Actor[akka://cromwell-system/deadLetters] was not delivered. [3] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
    [2018-05-19 03:21:14,60] [error] Outgoing request stream error
    akka.stream.AbruptTerminationException: Processor actor [Actor[akka://cromwell-system/user/StreamSupervisor-1/flow-859-0-mergePreferred#-127827058]] terminated abruptly
    [2018-05-19 03:21:14,60] [info] Message [akka.actor.FSM$Transition] from Actor[akka://cromwell-system/user/SingleWorkflowRunnerActor/ServiceRegistryActor/MetadataService/WriteMetadataActor#2118035117] to Actor[akka://cromwell-system/user/SingleWorkflowRunnerActor#-647487697] was not delivered. [4] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
    [2018-05-19 03:21:14,60] [error] Outgoing request stream error
    akka.stream.AbruptTerminationException: Processor actor [Actor[akka://cromwell-system/user/StreamSupervisor-1/flow-863-0-mergePreferred#-1909984975]] terminated abruptly
    Workflow c5fed572-4c03-4896-b777-5f752adb55e4 transitioned to state Failed
    [2018-05-19 03:21:14,75] [info] Automatic shutdown of the async connection
    [2018-05-19 03:21:14,76] [info] Gracefully shutdown sentry threads.
    [2018-05-19 03:21:14,77] [info] Shutdown finished.
    

    Below is the content of /n/data1/hms/dbmi/park/DATA/Gawad_single_cell_ALL/.JointDiscovery/.sh/cromwell-executions/JointGenotyping/c5fed572-4c03-4896-b777-5f752adb55e4/call-ImportGVCFs/shard-3162/execution/stderr.submit:

    sbatch: error: Batch job submission failed: Socket timed out on send/recv operation
    

    Any help would be much appreciated.

    Very best,

    Alon

  • mmahmmah Member, Broadie

    @alongalor In the future, I suggest that you start a new question as your queries do not appear to be related to the initial problem.

    Again, I suggest you configure Cromwell's rate of job submission. On Harvard Medical School's O2 cluster, which I suspect you are also using, this has improved the timeout problem for me. I have not yet seen a timeout after upgrading and configuring.

    The job-rate-control for the system section of the configuration file is very simple. My maximum scatter is ~300 jobs, so 1 minute to submit all of these jobs is still more than acceptable performance.

    system {
      job-rate-control {
        jobs = 5
        per = 1 second
      }
    }
    
  • alongaloralongalor Member
    edited July 16

    Hi @mmah,

    Thanks for your helpful response. I was not aware of that particular parameter - thanks for sharing!

    Relatedly, would you be able to share your entire configuration file? Pasting mine as it is before adding in your suggested parameters below in case you're interested:

    # This line is required. It pulls in default overrides from the embedded cromwell `application.conf` needed for proper
    # performance of cromwell.
    include required("application")
    
    system { 
      # If 'true' then when Cromwell starts up, it tries to restart incomplete workflows
      #workflow-restart = true
      # Max number of retries per job that the engine will attempt in case of a retryable failure received from the backend
      max-retries = 50
      # Number of seconds between workflow launches
      new-workflow-poll-rate = 60
      max-workflow-launch-count = 1
      io {
      # Global Throttling - This is mostly useful for GCS and can be adjusted to match
      # the quota availble on the GCS API
      #number-of-requests = 100000
      #per = 100 seconds
    
      # Number of times an I/O operation should be attempted before giving up and failing it.
      number-of-attempts = 10
      }
    }
    
    services {
      LoadController {
        config {
          control-frequency = 5 seconds
        }
      }
    
      HealthMonitor {
        config {
          # How long to wait between status check sweeps
          # check-refresh-time = 5 minutes
          # For any given status check, how long to wait before assuming failure
          # check-timeout = 1 minute
          # For any given status datum, the maximum time a value will be kept before reverting back to "Unknown"
          # status-ttl = 15 minutes
          # For any given status check, how many times to retry a failure before setting status to failed. Note this
          # is the number of retries before declaring failure, not the total number of tries which is 1 more than
          # the number of retries.
          check-failure-retry-count = 50
          # For any given status check, how long to wait between failure retries.
          check-failure-retry-interval = 30 seconds
        }
      }
    }
    
    backend {
      # Override the default backend.
      default = "SLURM"
    
      # The list of providers.
      providers {
    
        SLURM {
          actor-factory = "cromwell.backend.impl.sfs.config.ConfigBackendLifecycleActorFactory"
          config {        
    
        concurrent-job-limit = 300
    
        runtime-attributes = """
            Int runtime_minutes = 36000
            Int cpus = 1
            String requested_memory_mb_per_core = "8000"
            String queue = "priopark"
            String account_name = "park_contrib"
            """
    
            submit = """
                sbatch -J ${job_name} -D ${cwd} -o ${out} -e ${err} -t ${runtime_minutes} -p ${queue} \
                ${"-n " + cpus} ${"[email protected]"} \
                --mem-per-cpu=${requested_memory_mb_per_core} \
                --account=${account_name} \
                --wrap "/bin/bash ${script}"
            """
            kill = "scancel ${job_id}"
            check-alive = "squeue -j ${job_id}"
            job-id-regex = "Submitted batch job (\\d+).*"
          }
        }
    
      }
    }
    
  • alongaloralongalor Member
    edited July 17

    @mmah do you ever run multiple workflows at a time? I often experience these types of errors when running 10's of workflows (e.g. for several bams, using HaplotypeCaller, whose workflow takes as input an individual bam). Adding in the parameters you suggested does not remedy the issue either.

  • mmahmmah Member, Broadie

    @alongalor I do not run more than two concurrent workflows, and these do not scatter at exactly the same time.

    You basically have to limit your peak rate of requests to the SLURM scheduler. With 10's of workflows and my parameters, you may be trying to start hundreds of jobs per second. Cut this back to a conservative number. Assuming your jobs last more than a few seconds, the limiting factor for your jobs is ultimately not the scheduler but the availability of cluster computing resources.

  • JohnyWalterJohnyWalter Member
    edited August 20

    LSF can not resolve the unstable networking issue socket, but there is a workaround, according to the exit code when submit job failed, LSF can retry to submit the job:

    Before submitting job, set a environment variable:
    LSB_BSUB_ERR_RETRY="RETRY_CNT[5] ERR_TYPE[29]"

    bsub command will retry to submit job when submit job failed with exit code 29, the maximum retry times is 5, the default retry interval is 3 seconds.

    Thus, LSF could be able to handle this unstable network caused submission user verification failure situation.

    Post edited by JohnyWalter on
Sign In or Register to comment.