Queue + Torque: keep_completed required?

smowton

I'm running Queue 3.1 with Torque 4.2.8. I found that with Torque's default setup Queue reported spurious failures for tasks running on remote nodes but not the local node (my "cluster" has the pbs_server node doing compute as well).

The jobs in question would report FunctionEdge - Error:
but then print an apparently-successful GATK log file.

I found that after using qmgr to set queue queue_name keep_completed = 300 then the jobs successfully complete. However, considering that the job running on the local node always succeeded even without keep_completed set, I suspect I'm doing something obvious wrong, either with Torque or Queue, since I'm pretty new to both systems.

In particular I'm unsure which paths are expected to live in a node-private filesystem and which in a shared filesystem. I currently run my tasks from a shared scratch directory (yielding -Djava.io.tmpdir=/shared-scratch/.queue/tmp); is this correct? By contrast I note that for some reason /home/myuser/.drmaa/jobid.started and .exitcode files are being created, despite the fact that pbs_mom is running as root, which seems wrong.

Has anyone seen similar behaviour that could shed some light on the situation?

Best Answer


