Systems down due to power glitch – SOLVED
Today (November 26) at around 4:30 p.m., a short glitch on the city power lines led to most HPC systems going down: Fritz, Alex, Woody, Meggie, TinyGPU, TinyFat, the test cluster, and parts of Helma. Frontends are still up, but most jobs were killed. Admins are working to get the systems up and running again.
Updates will be posted here.
Wednesday 2025-11-26
-
- 18:00 –
memoryhogis back - 18:05 – local sessions on Tier3 Jupyterhub are back
18:05 – Fritz & Helma won’t be back today due to failure of the building’s central cooling infrastructure ZUV/G is working on it
NHR Jupyterhub is down as it’s running on a Fritz node- 18:20 –
w14xx(single socket Kaby Lake nodes) are back - 18:25 –
tg09x(A100 nodes) in TinyGPU are back - 18:30 –
tg07x(V100 nodes) in TinyGPU are back - 18:30 –
tg06x(RTX2080Ti nodes) in TinyGPU are back - 18:35 – GPU sessions on Tier3 Jupyterhub are back
- 18:45 –
w22xx/w23xx(ICX nodes) in Woody are back - 18:50 –
tg08x(RTX3080 nodes) in TinyGPU are back - 19:15 –
w24xx/w25xx(ICX nodes) in Woody are back - 19:15 – major parts of Meggie are back
- 19:25 – TinyFAT is back
- 20:10 – major parts of Alex are back; there still might be issues with anvme workspace
20:30 –w22xx(ICX nodes) in Woody down due to blown fuse
- 18:00 –
Thursday 2025-11-27
-
- 07:15 – most
testclusternodes are back in operation - 08:40 –
w22xx(ICX nodes) in Woody back in operation - 09:00 – NHR Jupyterhub (local + GPU sessions) are back
- 10:00 – Fritz is back
- 10:30 – most parts of Helma are back
- 07:15 – most
Saturday 2025-11-29
-
- 09:15 –
fviz1 (fvis)in Fritz is back in operation
- 09:15 –
Login nodes may require reboots once everything else stabilized.
One rack of Helma is still down.
Summary:
- [x] Alex
- [x] Fritz
- [x] Helma (partially)
- [x] NHR-Jupyterhub
- [x] Tier3-Jupyterhub
- [x] Meggie
- [x] LLM APIs
- [x] Testcluster
- [x] TinyGPU
- [x] Woody

