All clusters down due to power outage – SOLVED
Today at about 10:45 a.m., a large-area power outage in Erlangen has brought all our clusters down. All running jobs have been terminated. Some frontends have been switched off manually to lower the load on the uninterruptible power supply infrastructure.
We are working to get the systems running again. Currently it is unclear when this will be the case. We apologize for the inconvenience.
- 12:50 – batch processing on TinyFAT resumed
- 12:55 – batch processing on single-socket nodes of Woody (w14xx) resumed
- 13:15/13:25 – batch processing on (parts of) TinyGPU resumed
- 13:30 – batch processing on Meggie resumed
- 13:35 – Tier3 Jupyterhub resumed
- 15:40 – Alex login nodes available again (delayed due to Lustre storage)
- 15:45 – Fritz login nodes available again (delayed due to Lustre storage)
- 15:55 – batch processing on parts of Alex + Fritz compute nodes resumed (delayed due to Lustre storage)
- 16:00 – NHR Jupyterhub resumed
- 16:15 – Alex fully available again (delayed due to Lustre storage)
- 16:55 – Fritz fully available again (delayed due to Lustre storage)
- 17:45 – batch processing on dual-socket nodes (w2xxx) of Woody resumed (delayed due to outage of cooling infrastructure)