[bitfolk] 2025-02-02 ~00:03Z - Emergency reboot of host talisker

3 Feb 2025

Hi,

At approximately 00:03Z we start receiving alerts of various services
not responding and it was determined that host talisker was having some
problems with its storage.

There were lots of errors being spewed into the kernel log from the SAS
controller's driver mostly of a timeout variety, and none of the drives
attached to it were responding. A number of its MD RAID arrays fell
apart as a result and IO errors would have been seen inside your virtual
machines.

I did try a few things around resetting the controller but nothing
worked so at around 00:35 I had to forcibly kill all running VPSes and
reboot the host, which happened at about 00:29.

The host talisker booted without incident and all its RAID arrays synced
up. By around 00:39 all customer VPSes should have booted, and all those
we have monitoring for did show as up by then.

Due to abruptly losing access to storage, some data in memory will have
been lost, but hopefully apps are aware of that. I do not think any
reads or writes were corrupted so I don't think there should be any
filesystem corruption. If you are seeing any problems and your VPS is
actually on talisker than you should first have a look at your Xen
Shell consoles.

Apologies for the disruption. We will keep an eye on talisker to gain
some assurance that this was a one-off event.

Thanks,
Andy

-- 
https://bitfolk.com/ -- No-nonsense VPS hosting

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

[bitfolk] 2025-02-02 ~00:03Z - Emergency reboot of host talisker