2021-06-19 ~20:37Z - 2021-06-19 ~21:3Z: Emergency reboot of server "hobgoblin" - BitFolk Announcements

19 Jun 2021

Hi,

At about 20:37Z we started receiving alerts for customer services on
server "hobgoblin". It quickly became apparent that this was the
intermittent "I/O stall" problem we've been seeing on all servers
and have been grappling with for months now.

All I could do was power cycle the server, which happened at about
20:58Z.

We're still not able to reproduce the problem on demand and it can
be several months between incidents. We've tried upgrading
hypervisor and that's not helped. It's looking more like a problem
in the Linux kernel. So, I upgraded that as well to a newer
self-made package.

I've been communicating with a couple of the linux-raid devs and we
have some ideas but gathering information and making changes is
going slowly because of the lack of reproducibility and long time
between incidents. It's basically a case of making a single change
any time there is an issue.

With the upgrades done, the server was rebooted again and at about
21:19Z customer VMs started booting again. This was complete by
about 21:33Z.

Obviously I am not happy with these outages and I'm doing everything
I can to find the root cause.

Cheers,
Andy

-- 
https://bitfolk.com/ -- No-nonsense VPS hosting