Unscheduled downtime of the “Fritz” cluster from October 16
Unscheduled downtime of Fritz from October 16
On the night of October 15 we encountered severe issues with Slurm resulting in many jobs crashing.
Batch operation is currently halted. We are investigating the incident.
As of now we do not have an estimate when operation of Fritz can be resumed.
We will keep this post updated.
Update 11:45: By now it is clear that Slurm has completely destroyed its job database through corruption, although we do not know why. All signs point to “high quality software”. All running jobs were terminated at 3:11 tonight, all information about jobs that were still queued was lost and cannot be recovered safely. But the problems started before that: Jobs may have experienced weird errors since 18:45 yesterday.
You will have to requeue all jobs at some point. For now, please refrain from queueing new jobs – all jobs you queue now will have to be deleted again when we clean up this mess.
Update 12:30: You can now queue jobs again.
Update 12:45: Batch processing is slowly being resumed.
Update 14:15: Batch processing has been resumed. You can use the cluster normally again. Any jobs from before 12:30 today are gone, they either aborted or did not run at all – you will need to requeue them.