Over quota on $HOME and $HPCVAULT file systems

2024-01-19

Between January 18 at about 15:00 and January 19 at about 09:45, some users might have experienced difficulties with job execution on all clusters because they were wrongly reported as having exceeded their quotas on the /home/hpc and/or /home/vault file systems. For /home/hpc, that problem was already fixed at around 20:00 on January 18, for /home/vault it took longer to fix.

This unfortunate incident was an unexpected consequence of a temporary safety measure we put in place yesterday. Currently, all data on /home/hpc and /home/vault is replicated to two different disc arrays. Unfortunately, due to the way this is implemented, it means that everything you store will at the moment be counted towards your quota usage twice. So for example, if you store 1 GB of data on /home/vault, you will currently use 2 GB of your quota. We have temporarily doubled all quotas now to accommodate for that.

In the unlikely case that you experienced data loss, because e.g. you tried to update a file on /home/vault and it was truncated because you had exceeded your quota, please use the snapshot feature of the file system as described here: https://hpc.fau.de/systems-services/documentation-instructions/hpc-storage/#snapshots

For those interested in the details, the reason for turning on replication yesterday was that we were advised by the hardware manufacturer that we had to expect an excessively high (mechanical) failure rate for some of the hard disc drives in our system – possibly high enough to cause failure of a RAID array despite RAID6. While we do have backups, we’d rather prefer not to have to use them, and avoid the downtime that comes with using them – so we’re now temporarily doing software-RAID1 over RAID6 until about 100 hard disc drives have been changed.

In case of problems, please contact hpc-support@fau.de.

We apologize for the inconvenience.