WDL fail with NoSuchFileException, but the file exists

I've encountered the error multiple times when submitting large jobs on LSF (>1000 scatter pieces). WDL seemed to fail with a file not found error always in regard to the stderr file, but when I look up the file manually the file was always there, and the specific task also finished with rc=0, but the main cromwell process failed with return code of 1 already due to the file not found error.

I've tried to continue on return code, didn't work. I have retry on IO fail set at 200. At the moment I'm just forcing LSF to requeue the main cromwell job if return any return code that is not 1, and relying on mySQL caching to ensure that I'm creating too much junk, but I was wondering if there's something I've missed.

The error message is as below. Any idea would be greatly appreciated.

[ERROR] [06/20/2017 18:48:19.103] [cromwell-system-akka.dispatchers.engine-dispatcher-63] [akka://cromwell-system/user/SingleWorkflowRunnerActor/WorkflowManagerActor] WorkflowManagerActor Workflow f63b2032-9566-4ca1-9fc3-5982fb72584c failed (during ExecutingWorkflowState): java.nio.file.NoSuchFileException: /data/talkowski/hw878/PennCNV/BAFgeneration/BAF/BAFfilegenerate/File-generate/cromwell-executions/Filegen/f63b2032-9566-4ca1-9fc3-5982fb72584c/call-work/work/ad040180-1ccb-485d-9a18-559a8cb919c4/call-BAFgen/shard-6068/execution/stderr cromwell.core.CromwellFatalException: java.nio.file.NoSuchFileException: /data/talkowski/hw878/PennCNV/BAFgeneration/BAF/BAFfilegenerate/File-generate/cromwell-executions/Filegen/f63b2032-9566-4ca1-9fc3-5982fb72584c/call-work/work/ad040180-1ccb-485d-9a18-559a8cb919c4/call-BAFgen/shard-6068/execution/stderr at cromwell.core.CromwellFatalException$.apply(core.scala:17) at cromwell.core.retry.Retry$$anonfun$withRetry$1.applyOrElse(Retry.scala:37) at cromwell.core.retry.Retry$$anonfun$withRetry$1.applyOrElse(Retry.scala:36) at scala.concurrent.Future$$anonfun$recoverWith$1.apply(Future.scala:346) at scala.concurrent.Future$$anonfun$recoverWith$1.apply(Future.scala:345) at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) at akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55) at akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:91) at akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91) at akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91) at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72) at akka.dispatch.BatchingExecutor$BlockableBatch.run(BatchingExecutor.scala:90) at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:39) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:415) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Caused by: java.nio.file.NoSuchFileException: /data/talkowski/hw878/PennCNV/BAFgeneration/BAF/BAFfilegenerate/File-generate/cromwell-executions/Filegen/f63b2032-9566-4ca1-9fc3-5982fb72584c/call-work/work/ad040180-1ccb-485d-9a18-559a8cb919c4/call-BAFgen/shard-6068/execution/stderr at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86) at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) at sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes(UnixFileAttributeViews.java:55) at sun.nio.fs.UnixFileSystemProvider.readAttributes(UnixFileSystemProvider.java:144) at sun.nio.fs.LinuxFileSystemProvider.readAttributes(LinuxFileSystemProvider.java:99) at java.nio.file.Files.readAttributes(Files.java:1737) at java.nio.file.FileTreeWalker.getAttributes(FileTreeWalker.java:219) at java.nio.file.FileTreeWalker.visit(FileTreeWalker.java:276) at java.nio.file.FileTreeWalker.walk(FileTreeWalker.java:322) at java.nio.file.FileTreeIterator.<init>(FileTreeIterator.java:72) at java.nio.file.Files.walk(Files.java:3574) at better.files.File.walk(File.scala:467) at better.files.File.size(File.scala:501) at cromwell.core.path.BetterFileMethods$class.size(BetterFileMethods.scala:323) at cromwell.core.path.DefaultPath.size(DefaultPathBuilder.scala:53) at cromwell.engine.io.nio.NioFlow$$anonfun$size$1.apply$mcJ$sp(NioFlow.scala:67) at cromwell.engine.io.nio.NioFlow$$anonfun$size$1.apply(NioFlow.scala:67) at cromwell.engine.io.nio.NioFlow$$anonfun$size$1.apply(NioFlow.scala:67) at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24) at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24) ... 6 more

Issue · Github
by Geraldine_VdAuwera

Issue Number
2193
State
closed
Last Updated
Milestone
Array
Closed By
vdauwera

Best Answer

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hi @awacs, sorry for the delay -- we need to have the engineering team take a look as this is a bit outside our support team's comfort zone. In the meantime, can you give a bit more detail about what version of Cromwell you're using, and anything more you can tell us about the jobs that are failing with this error? Ie does it look random or is it reproducible for a particular job or submission?

  • awacsawacs Member
    edited June 2017

    I'm using version 26, and the error doesn't seem to be reliably reproducible. In fact, the way I'm bypassing it is just to run the whole workflow again when the previous workflow fails out. It happens mostly when I have >100 size scatter gather construct going on.

  • mcovarrmcovarr Cambridge, MAMember, Broadie, Dev

    @awacs

    Is it possible to post the WDL for the task that's giving you this problem? Also, what version of Cromwell are you using?

    My suspicion is a race condition between the creation of the stderr file being visible in the filesystem and the Cromwell engine trying to inspect the stderr file for nonzero size. Hopefully Cromwell only checks for nonzero stderr size if failOnStderr is set to true; is that the case for this task?

    It may be possible as an unsavory short-term workaround to set the script-epilogue value in your LSF config to add in a sleep:

    backend {
      default = "LSF"
      providers {
        LSF {
          actor-factory = "cromwell.backend.impl.sfs.config.ConfigBackendLifecycleActorFactory"
          config {
    
            # Limits the number of concurrent jobs
            #concurrent-job-limit = 5
    
            run-in-background = true
            # `script-epilogue` configures a shell command to run after the execution of every command block.
            #
            # If this value is not set explicitly, the default value is `sync`, equivalent to:
            # script-epilogue = "sync"
            #
            # To turn off the default `sync` behavior set this value to an empty string:
            # script-epilogue = ""
    
            # `sleep` hack to give the FS time to register file creations
            script-epilogue = "sync; sleep 5"
    
    
  • awacsawacs Member

    @mcovarr said:
    @awacs

    Is it possible to post the WDL for the task that's giving you this problem? Also, what version of Cromwell are you using?

    My suspicion is a race condition between the creation of the stderr file being visible in the filesystem and the Cromwell engine trying to inspect the stderr file for nonzero size. Hopefully Cromwell only checks for nonzero stderr size if failOnStderr is set to true; is that the case for this task?

    It may be possible as an unsavory short-term workaround to set the script-epilogue value in your LSF config to add in a sleep:

    backend {
      default = "LSF"
      providers {
        LSF {
          actor-factory = "cromwell.backend.impl.sfs.config.ConfigBackendLifecycleActorFactory"
          config {
    
            # Limits the number of concurrent jobs
            #concurrent-job-limit = 5
    
            run-in-background = true
            # `script-epilogue` configures a shell command to run after the execution of every command block.
            #
            # If this value is not set explicitly, the default value is `sync`, equivalent to:
            # script-epilogue = "sync"
            #
            # To turn off the default `sync` behavior set this value to an empty string:
            # script-epilogue = ""
    
            # `sleep` hack to give the FS time to register file creations
            script-epilogue = "sync; sleep 5"
     
    

    I do not have fail on stderr set to true.

  • mcovarrmcovarr Cambridge, MAMember, Broadie, Dev

    OK I can see that Cromwell is querying the stderr size whether it needs to or not; I'm testing out some changes to fix that behavior. In the meantime you might want to try adjusting script-epilogue to see if that helps.

  • mcovarrmcovarr Cambridge, MAMember, Broadie, Dev
    edited June 2017

    PR for the forthcoming Cromwell 28 release to address the needless and potentially error-prone stderr access.

  • awacsawacs Member

    @mcovarr said:
    OK I can see that Cromwell is querying the stderr size whether it needs to or not; I'm testing out some changes to fix that behavior. In the meantime you might want to try adjusting script-epilogue to see if that helps.

    Unfortunately, this does not seem to help.

  • awacsawacs Member

    @mcovarr said:
    OK I can see that Cromwell is querying the stderr size whether it needs to or not; I'm testing out some changes to fix that behavior. In the meantime you might want to try adjusting script-epilogue to see if that helps.

    Does "workflow_failure_mode" options apply in this case?

  • danbdanb Member, Broadie

    workflow_failure_mode set to ContinueWhilePossible will not resolve the issue. It will proceed with other tasks despite failures, so you will proceed farther, but you will still need to restart the job to re-run the failed tasks.

    The PR to check stderr only if necessary will help but there seems to be a persistent race condition. I've created this bug for investigation purposes.

  • awacsawacs Member

    @danb said:
    workflow_failure_mode set to ContinueWhilePossible will not resolve the issue. It will proceed with other tasks despite failures, so you will proceed farther, but you will still need to restart the job to re-run the failed tasks.

    The PR to check stderr only if necessary will help but there seems to be a persistent race condition. I've created this bug for investigation purposes.

    I asked this question in another thread, but I am just wondering if I set to continuewhenpossible, and the main cromwell process failed because of the nosuchfileexception, but will the spawned scatter processes, already started by the main cromwell process before failure, be recorded in the mysql cache when they complete successfully? If so then I can requeue (resubmit) the main cromwell process in case of failure.

    On a side note, how do i check whether a job has been cached or not?

  • danbdanb Member, Broadie

    Yes, the other tasks will be recorded in the cache upon successful completion.

    We are adding a cache-debug endpoint to the next release, see more here.

  • awacsawacs Member
    Accepted Answer

    Note, this problem has been successfully fixed in cromwell version 28.

Sign In or Register to comment.