Maintenance of HPC systems on March 11 and March 20

2024-03-19

There will be a scheduled downtime of our HPC systems:

Monday, March 11, from 06:50 to 16:00, affecting ONLY Fritz
Wednesday, March 20, from 00:00 to end of the day, affecting ALL our clusters including frontends and fileservers

For the March 20 downtime: Reason is general maintenance work, including but not limited to some central fileservers and network infrastructure, which is why you will be unable to even log in or access files for at least part of the maintenance period. As usual, jobs that would collide with the downtime will automatically be postponed until after the downtime. And when we say this is going to take the whole day, we mean it – there is little chance of this being finished before the late afternoon.

Update 20.03. 20:00: we’re done with almost everything except some finishing touches. Batch processing will be resumed tomorrow morning.
Update 21.03. 07:30: unfortunately, a badly timed disk failure tonight will delay things. New estimate is “batch processing will be resumed around noonish”.
Update 21.03. 10:30: all frontends and filesystems should already be available again, just processing of batch jobs has not resumed yet.
Update 21.03. 11:45: batch processing on Alex and Fritz has been resumed.
Update 21.03. 12:45: batch processing on Woody has been resumed.
Update 21.03. 14:30: batch processing on TinyFat and TinyGPU has been resumed.
Update 21.03. 16:00: batch processing on Meggie has been resumed.
Update 21.03. 16:00: that’s it. maintenance has finished.

For the March 11 downtime from 06:50 to 16:00, affecting ONLY Fritz: Reason is work on the electricity infrastructure in the building. Only the Fritz computenodes are affected, the Fritz frontends will stay available because they are on an UPS, and all other clusters will be operating normally. Jobs on Fritz that would collide with the downtime will automatically be postponed until after the downtime. The end time of 16:00 is more of a worst case estimate – work should normally be finished around lunch time.

Update: Other than planned, frontends of Fritz and Alex were (partially) also affected, as the Infiniband switch for Lustre was not on UPS.
Update 14:45: Work is finished, we’re in the process of powering things back on …
Update 18:30: Batch processing has been resumed. Bringing all nodes online again took much longer than expected due to buggy firmware, recabling of the infiniband-switches for Lustre (they now ARE on UPS), and some more unexpected obstacles.

We will update this post when maintenance finishes.