Heads up:
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.

Transient error issues

mleventhalmleventhal Cambridge, MAMember, Broadie ✭✭

Hello,

In the last week I have been running many workflows (sample sets of 1000), and each time I run into the following set of transient errors:

1) All tasks "succeed" but the workflow fails (screenshot 1, FC is unable to find the json_auth file it looks like)

2) The JES log notes that all of the output files have been transferred to the bucket, and while they are present in the bucket the task never changes from "running" to "succeeded", leading to workflows that have been arrested for >24 hours

3) The job store connection is unexpectedly lost and kills the job that was currently running (screenshot 2)

In the last week the majority of sample sets I run encounter one of the above issues, including those run on smaller sample/pair sets. While the solution has been to simply re-start the failed jobs and abort the hung ones, it is frustrating to have to perform maintenance on my tasks to ensure that none of these occur. Is there any way to avoid these errors or is there work being done to prevent these from occurring? Thank you.

Best,
Matt Leventhal

Answers

  • SChaluvadiSChaluvadi Member, Broadie, Moderator admin

    @mleventhal Sorry to hear that you are seeing so many errors! I will take a look/loop in the appropriate team members and get back to you with more information/solutions.

  • AdelaideRAdelaideR Member admin

    @mleventhal

    Is there any possibility that you could share your workspace with me so I may look at the error logs? Sometimes they contain more information.

  • mleventhalmleventhal Cambridge, MAMember, Broadie ✭✭

    @AdelaideR Just shared the workspace. Let me know if you find anything of particular interest

  • jrouhanajrouhana Member
    I'm posting to share that I've been having similar difficulties as mleventhal for the past week. I'm running into the same situation where I do not see errors and the Failure just says that auth.json not found. Re-running the job usually succeeds. Aside from that, I had one job that said 'running' for far longer than it should have, even after the tasks had succeeded. I tried to abort yesterday, and it still says 'Aborting'. I don't think the problem is isolated to @mleventhal 's workspace.
  • AdelaideRAdelaideR Member admin

    @mleventhal and @jrouhana I have forwarded your concerns to the workbench team.

    In the meantime, I have a workaround that may save some time.

    If you go into the Terra platform and look at your job history, you can click a button to just rerun the failed samples, which should save you some time. I see that in one of your 1000 job submissions, only 3 failed near the end. You should only have to rerun those three.

    It may take a little time to hear back from the workbench team, but I will keep you posted.

Sign In or Register to comment.