Scheduled downtime of Fritz and Helma clusters from October 06 – 15

checkmark-sunshine

There will be a long scheduled downtime of the Fritz and Helma clusters,

  • starting on Monday, October 06,
  • and lasting at least until Friday, October 10 Wednesday, Oct. 15.

While the general expectation is that the clusters will be back by Friday night, the downtime might have to be extended over the weekend.

The reason for the downtime is the (massively delayed) installation of new transformers for the building housing Fritz and Helma.

As usual, jobs that would collide with the downtime will automatically be postponed until after the downtime.
The frontends and parallel filesystems (/lustre, /hnvme) of these clusters should stay available during the downtime, but with electricians literally ripping out all power supply going into the building, let’s just say there is a certain risk that things might not entirely go to plan.

We will keep this post updated.

Update 2025-10-06 16:30: There just was an outage of some Fritz frontends and the /lustre due to a blown fuse. Unfortunately, while the works are ongoing, there is no redundancy, so a single blown fuse will take out quite a few things. Fritz Frontends and /lustre should be back up again by now.

Update 2025-10-10: The downtime has to be extended until at least Tuesday (Oct. 14).

Update 2025-10-13@17:00: three of the six Helma login nodes are currently unavailable, after unexpected problems with a routine update.

Update 2025-10-13@17:10: all Helma compute nodes will be fully reinstalled

Update 2025-10-13@17:30: early test jobs on Helma may have failed with “Unable to locate a modulefile for …”

Update 2025-10-13@18:00: batch operation started on 1/4 of Helma

Update 2025-10-13@18:15: there seem to be issues with /hnvme on Helma

Update 2025-10-14@08:00: batch processing on Helma has been suspended again

Update 2025-10-14@14:00: batch processing on large parts of Helma has been resumed.

Update 2025-10-14@16:10: Power to the building is finally restored at full capacity. We will now start work on bringing back Fritz and the rest of Helma, but this will likely NOT be finished today.

Update 2025-10-14@17:00: two of the six Helma login nodes are still unavailable.

Update 2025-10-15@09:45: all Helma login nodes are now back. So with the exception of a small handful of nodes, Helma is now back in regular operation.

Update 2025-10-15@11:45: regular batch processing has been resumed on Fritz.

Update 2025-10-15@12:15: apart from small amounts of cleanup, this is now done.

Update 2025-10-20@09:45: the login node helma1 has been missing user accounts since 2025-10-13.