Unscheduled dowtime: Alex, Fritz, Helma and Woody on 30. April

checkmark-sunshine

On the afternoon of April 29, a very bad security problem titled ‘copy.fail’ affecting all Linux Kernels since 2017 became public.

We hastily deployed a mitigation on our Ubuntu based systems, but unfortunately, no such mitigation was possible on our AlmaLinux based systems. As a result, we had to block access to our AlmaLinux based clusters in the middle of the night, namely Alex, Fritz, Helma and Woody. The Ubuntu based systems TinyGPU and TinyFat continued to operate normally.

By now, we have built a working mitigation for the AlmaLinux based systems. We are now in the process of distributing that and resuming normal cluster operation. This will however take some time.

Update 30.04. 15:10: the frontends of Helma are reachable again, and batch processing Helma has been partially resumed.

Update 30.04. 16:00: the frontends of Fritz are reachable again, and batch processing on Fritz has been partially resumed.

Update 30.04. 18:30: the frontends of Woody are reachable again, and batch processing on Woody has been partially resumed.

Update 30.04. 18:45: the frontends of Alex are reachable again, and batch processing on Alex has been partially resumed.

Update 30.04. 18:45: that’s it for today. Large parts of the clusters are online again, remaining cleanup will have to happen after the extended weekend.

Update 04.05. 15:00: we’re mostly done with cleanup. There will be another reboot of all machines (including the Ubuntu based ones) once proper updates are released, but for now things should be back to normal again.