Unscheduled downtime of most of our systems since Friday, Oct. 10 until Oct. 16

checkmark-sunshine

Due to complications encountered during maintenance work on the cooling infrastructure, said infrastructure is currently operating with severely reduced capacity.

To ensure cooling availability for critical infrastructure, we will have to shut down most of the clusters. Jobs that are already running will probably be able to finish, but no new jobs will be run.

Affected clusters are:

  • parts of Alex
  • Meggie
  • large parts of TinyGPU (RTX2080Ti are available)
  • TinyFat
  • Testcluster
  • small parts of Woody

There is more bad news: The planned repair date is currently Thursday, October 16!

Moreover, Fritz & Helma are also down due to scheduled infrastructure work, cf. https://hpc.fau.de/2025/10/06/scheduled-downtime-of-fritz-and-helma-clusters-from-october-06/

Update 2025-10-11: 1/4 of Alex is back

Update 2025-10-14: 1/2 of Alex is back

Update 2025-10-16 15:30: Cooling is back at full capacity. We will now power up the clusters.

Update 2025-10-16 15:45: Alex and Woody are back.

Update 2025-10-16 17:00: Meggie, parts of TinyFAT (all but the 256 GB Broadwell nodes), parts of TinyGPU (only RTX3080 and A100 and some Jupyternodes; RTX2080TI&V100 are not available), Testcluster

Update 2025-10-17 08:15: RTX2080Ti nodes in TinyGPU are back

Update 2025-10-17 18:00: V100 nodes in TinyGPU and Alex are back

Update 2025-10-19 15:00: 256 GB Broadwell nodes of TinyFAT are back