Scheduled downtime of all systems at NHR@FAU on February 11 and 12

2026-02-04

There will be a scheduled downtime of all the HPC systems of NHR@FAU, starting on

Wednesday, February 11, at 8:30,
and expected to last until
the evening of Thursday, February 12.

Reason for the downtime is maintenance on various systems, including but not limited to central fileservers.

Jobs that would collide with the downtime will automatically be postponed until after the downtime. Frontends and fileservers will also experience quite a few interruptions, so you should not expect to be able to access your files or log in at all during the downtime.

We will keep this post updated during the downtime.

Update 2026-02-12 12:30: We’re done with maintenance on central systems (e.g. fileservers), and will now start to bring the clusters back online (one after another).

Update 2026-02-12 14:00: Batch processing has been resumed on Woody.

Update 2026-02-12 15:00: Batch processing has been resumed on TinyFAT and TinyGPU.

Update 2026-02-12 16:45: Batch processing has been resumed on Alex.

Update 2026-02-12 17:20: Batch processing has been resumed on Fritz.

Update 2026-02-12 17:30: Unfortunately, batch processing on Helma will not be resumed today, mainly thanks to NVIDIA DOCA being a great piece of software with outstanding compatibility and stability that almost never breaks and almost never requires building workarounds for two days. We expect Helma to be back around noon tomorrow.

Update 2026-02-13 12:00: Batch processing has been resumed on Helma.

Update 2026-02-13 12:00: we’re done with this maintenance. (Everything that follows is unplanned corrective work after the update.)

Update 2026-02-16 08:30: The new Lustre client on Helma proved to be extremely unstable, leading to random errors when trying to write files on Helmas /hnvme. We have now downgraded the Lustre-client and so far cannot reproduce the problems anymore.

Update 2026-02-16 08:30: The fileservers for $HOME and $VAULT (/home/hpc/ and /home/vault/) have also been unstable since the update, leading to random hangs and consequential errors, like temporary failures to log in via SSH. We’re still investigating these issues but do NOT have a solution yet, so this is ongoing.

Update 2026-02-16 12:30: There are now also intermittent problems with the /lustre in Fritz (also available from Alex).

Update 2026-02-16 15:00: The problems with /hnvme on Helma are now definitely resolved (we can by now easily reproduce the problems, and know which Lustre-feature needs to be disabled in the new client to instantly fix it).

Update 2026-02-17 08:00: Helma did not send out eMails to users requesting notifications about their jobs since the downtime – this has been fixed.

Update 2026-02-17 09:00: The fileservers for $HOME and $VAULT (/home/hpc/ and /home/vault/) should now be stable again after a kernel downgrade.

Update 2026-02-17 09:00: The /lustre in Fritz (also available from Alex) has been stable since increasing a timeout setting yesterday evening, so we’re cautiously optimistic this problem is now also fixed.