Unable to determine status of job ID error

Hi,
I occasionally get the following type of error when using genome strip CNVDisvovery pipeline on a cluster running SGE, it will run successfully for long periods of time and then arrive at this type of error
WARN 18:12:43,778 DrmaaJobRunner - Unable to determine status of job id 128098
1322 org.ggf.drmaa.DrmCommunicationException: failed receiving gdi request response for mid=63604 (can't send response for this message id - protocol error).

Usually I can just restart the pipeline and it will go to completion, but I don't know if I need to restart the whole stage? I have a hard time understanding sometimes whether there are actually problems I need to address. Any tips on troubleshooting this correctly?

Thanks!

Answers

  • I've noticed that stage1 seq_11 has a vcf file with no size (for unknown reasons) how can I re-run this stage do I need to re-run the whole pipeline? can I delete sentinel files to back up?

  • I have the same issue here. Did you happen to solve the problem?

  • bhandsakerbhandsaker Member, Broadie, Moderator

    You can delete the sentinel files if you want to force the stage to rerun. Rerunning a stage will rerun the Queue pipelines for that stage, which will only redo work that needs to be redone. But there shouldn't be a sentinel file unless the stage completed successfully.

    The original problem was likely a transient failure. My first line of defense is always to retry. If the exact same job fails on retry, then I will dig into the (several layers of) log files to see whether there is some reproducible problem.

  • hjzhouhjzhou Member

    Actually I have a slightly different error. I will start another thread.

Sign In or Register to comment.