Systems down due to power glitch

exclamation-rain

Today (November 26) at around 4:30 p.m., a short glitch on the city power lines led to most HPC systems going down: Fritz, Alex, Woody, Meggie, TinyGPU, TinyFat, the test cluster, and parts of Helma. Frontends are still up, but most jobs were killed. Admins are working to get the systems up and running again.

Updates will be posted here.

    • 18:00 – memoryhog is back
    • 18:05 – local sessions on Tier3 Jupyterhub are back
    • 18:05 – Fritz & Helma won’t be back today due to failure of the building’s central cooling infrastructure ZUV/G is working on it
      NHR Jupyterhub is down as it’s running on a Fritz node
    • 18:20 – w14xx (single socket Kaby Lake nodes) are back
    • 18:25 – tg09x (A100 nodes) in TinyGPU are back
    • 18:30 – tg07x (V100 nodes) in TinyGPU are back
    • 18:30 – tg06x (RTX2080Ti nodes) in TinyGPU are back
    • 18:35 – GPU sessions on Tier3 Jupyterhub are back
    • 18:45 – w22xx/w23xx (ICX nodes) in Woody are back
    • 18:50 – tg08x (RTX3080 nodes) in TinyGPU are back
    • 19:15 – w24xx/w25xx (ICX nodes) in Woody are back
    • 19:15 – major parts of Meggie are back
    • 19:25 – TinyFAT is back
    • 20:10 – major parts of Alex are back; there still might be issues with anvme workspace
    • 20:30 – w22xx (ICX nodes) in Woody down due to blown fuse
    • 27.11 7:15 – most testcluster nodes are back in operation
  • Login nodes may require reboots once everything else stabilized

Summary:

  • [x] Alex
  • [ ] Fritz
  • [ ] Helma
  • [ ] NHR-Jupyterhub
  • [x] Tier3-Jupyterhub
  • [x] Meggie
  • [ ] LLM APIs
  • [x] Testcluster
  • [x] TinyGPU
  • [x] Woody (except w22xx which are on a blown fuse)