Outage of HPC services due to file system issues [SOLVED]

2024-02-10

On Friday, February 9th @ 19:30, one of the network file server (NFS) stopped working. As a consequence, many login and compute nodes became unresponsive as a major file system ($WORK for NHR projects) could not be accessed.

The NFS server (atuin) has been rebooted and operation seems to be stable again since Saturday, February 10th @ 08:10. Last hanging mount points have been fixed by 10 o’clock on Saturday morning.

Jobs already running or started between yesterday @ 19:30 and today @ 08:10 may be impacted. The runtime of some jobs (especially on Alex) has therefore been extended.

On Sunday, February 11th @ 16:00, an other network file server (NFS) stopped working. As a consequence, many login and compute nodes became unresponsive as a major file system ($WORK for FAU/RRZE Tier3 users) could not be accessed.

The NFS server (wnfs1) is currently rebooting. We expect it to be back in operation by 19:00.

Jobs already running or started between Sunday @ 16:00 and @ 19:30 may be impacted.