BookStack CyberStorm Website Keycloak LibreNMS Netbox Network Builder Proxmox Wazuh Website VPN HSPC Megafonzie
A hard drive in the RAID array backing CNS infrastructure began behaving abnormally, degrading performance on all CNS services to unusable levels. The drive was replaced and the RAID setup rebuilt itself.
All times are in Eastern Daylight Time (UTC-04:00).
zfs1
, our infrastructure storage server, reboots. The cause of the reboot is unknown, but believed to be the result of temporary power loss due to the weather. At this time, the hard drive in bay 3 fails.delphox
and beheeyem
, the CNS domain controllers.zfs1
is rebooted in an attempt to start the RAID rebuild process using a hot spare.zfs1
is rebooted again.zfs1
is started into the operating system.The sudden power loss caused one of the disks in the RAID array to exhibit SMART errors and behave erratically. The degraded performance made the entire array unusable, resulting in all the services becoming unavailable.
The array uses RAID 5, which can rebuild from single-drive failures. It was rebuilt using a hot-spare already installed in the array. Afterwards, the failed drive was replaced with a working one in order to prevent the failed drive from being used further.
This incident served as a reminder of the instability of our storage backend, which relies on one server and storage array in order to power all CNS operations. We will be investigating the storage backend in order to improve its reliability in the future.
Additionally, this incident highlighted the need for better backups of our core infrastructure. While we were able to restore BookStack from a backup, most of our other infrastructure – including our domain configuration and network inventory – is not backed up offsite. This meant we had to hope for a successful recovery or else we stood to lose important data. Going forward, we will focus on improving our disaster recovery procedures, including taking regular backups.