evolutionchamber.org outtage

June 11, 2025

Didnt write down when it started but was down for almost 2 weeks.

Full outtage of all important systems due to an issues with the CEPH cluster.

One of the two OSDs failed and this caused all volumes to not mount. Curiously, all data is mirrored between the two (so no data loss) but having one OSD down brought it all down. I have had a previous OSD failure, but dont recall it doing this. One thing I’ve noted is that my .mgr pool is in a 3x replica even though I’ve only ever had 2 OSDs up at a time. Not entirely sure how it got a 3x replica. So my guess is that with it down to 1 replica it refused to service writes and these writes are required to mount volumes. Haven’t found any documentation in CEPH that backs this usage pattern of .mgr or hypothosis up.

Onboarded a new RPi5 varient to the cluster that uses two nvme drives for root and storage (will be making a separate post for that).

Most of why it took so long to get it resolved was waiting for some parts to arrive.

Once the new node was onboarded to the cluster, was able to get the second nvme being used as a new OSD. Then was able to remove the failing OSD from the cluster and CEPH replicated the data from the working OSD to the new one as expected and all my volumes were able to mount and everything came back online.

All on all, happy with CEPH in that it was able to handle replacing the OSD without much fuss. However, did present some issues with the overall hyperconvergent nature of the cluster. With auth down it became a bit of a chore to get terraform to run as it was trying to refresh application integration components on the auth provider. Thankfully, had not committed to pushing terraform state into CEPH managed object storage and it was still local. I know I dont want to keep it local but will need a better spot for it I think. Think this will also point me to having an extra OSD running and putting some more critical components on 3x replicas.