Transient error issues
In the last week I have been running many workflows (sample sets of 1000), and each time I run into the following set of transient errors:
1) All tasks "succeed" but the workflow fails (screenshot 1, FC is unable to find the json_auth file it looks like)
2) The JES log notes that all of the output files have been transferred to the bucket, and while they are present in the bucket the task never changes from "running" to "succeeded", leading to workflows that have been arrested for >24 hours
3) The job store connection is unexpectedly lost and kills the job that was currently running (screenshot 2)
In the last week the majority of sample sets I run encounter one of the above issues, including those run on smaller sample/pair sets. While the solution has been to simply re-start the failed jobs and abort the hung ones, it is frustrating to have to perform maintenance on my tasks to ensure that none of these occur. Is there any way to avoid these errors or is there work being done to prevent these from occurring? Thank you.