Emergency shutdown of all Woody Ice Lake w2xxx nodes

checkmark-sunshine

A big water pipe has burst in the basement of the building where the w2xxx nodes are housed at around 12:30. As a result, the cooling water that used to be in the pipes was forming a pretty little swimming pool in the basement, and the cooling for the w2xxx nodes (i.e. the nodes with 32 Intel IceLake cores per node) has failed completely.

To avoid hardware damage, we quickly had to shut down all the affected nodes. This means that all jobs running at that time on these nodes will have failed hard – there was no time for a proper shutdown or waiting for jobs to finish. We apologize for the inconvenience.

For now, all batch processing on Woody (except on the single-socket w13xx, w14xx, and w15xx nodes that are housed elsewhere) is suspended – and it might be for some time. We do not know how big the damage is and how long repairs will take. We will naturally keep this post updated.

Update 2024-06-18 14:00: Repairs will likely finish today, refilling of the cooling circuit is planned for tomorrow. With a bit of luck, the w2xx nodes will be back by tomorrow evening.

Update 2024-06-19 10:00: Cooling is back. Batch processing will be resumed.

The cause of the unplanned shutdown, a burst water pipe in the basement. Note that by the time this image was taken, most water was already gone. Image credit: Kay Graf, ECAP.