Hi,
At about 20:37Z we started receiving alerts for customer services on
server "hobgoblin". It quickly became apparent that this was the
intermittent "I/O stall" problem we've been seeing on all servers
and have been grappling with for months now.
All I could do was power cycle the server, which happened at about
20:58Z.
We're still not able to reproduce the problem on demand and it can
be several months between incidents. We've tried upgrading
hypervisor and that's not helped. It's looking more like a problem
in the Linux kernel. So, I upgraded that as well to a newer
self-made package.
I've been communicating with a couple of the linux-raid devs and we
have some ideas but gathering information and making changes is
going slowly because of the lack of reproducibility and long time
between incidents. It's basically a case of making a single change
any time there is an issue.
With the upgrades done, the server was rebooted again and at about
21:19Z customer VMs started booting again. This was complete by
about 21:33Z.
Obviously I am not happy with these outages and I'm doing everything
I can to find the root cause.
Cheers,
Andy
--
https://bitfolk.com/ -- No-nonsense VPS hosting
Hi,
At around 00:33 BST (23:33Z) we started to receive alerts regarding
services on host "clockwork". Upon investigation it was showing all
the signs of being the intermittent "frozen I/O" problem we've been
having:
https://lists.bitfolk.com/lurker/message/20210425.071102.9d9a1cc5.en.html
As mentioned in that earlier email, I'd decided that the next step
would be prepare new hypervisor packages and I did do that the next
day.
As this issue seems to happen only every few months and on different
servers we do not yet know if the new packages fix the problem.
They've been in use on other servers since late April without
incident, but that isn't yet proof enough given the long periods
between occurrences.
Anyway, after "clockwork" was power cycled the new packages were
installed there and then all VMs were started again. This was
completed by about 01:16 BST (00:16Z).
There are still many of our servers where we know this is going to
happen again at some point. I don't feel comfortable scheduling
maintenance to upgrade them when I don't know if the upgrade will be
effective. If we can go a significant period of time on the newer
version without incident then we will schedule a maintenance window
to get the remaining servers on those versions too. It is also
possible that there will be a security patch that forces a
maintenance, in which case we'll upgrade the hypervisor packages to
the newer version at the same time.
There are also some servers still left to be emptied so that their
operating systems can be upgraded. Those are "hen" and "paradox".
Once they are emptied and upgraded they will of course be put on the
newer version of the hypervisor. It is expected that customer
services we move from these servers will be put on servers that
already have the newer hypervisor version.
Thank you for your patience and I apologise for the disruption. I'm
doing all that I can to try to find a solution.
Thanks,
Andy
--
https://bitfolk.com/ -- No-nonsense VPS hosting