Hi,
After receiving a number of alerts for VMs hosted on server "jack",
I investigated and found the server largely unresponsive.
Unfortunately I had no option but to forcibly reboot it, which I did
at about 06:47Z
It's now 07:01Z and monitoring says everything is back up, except
for two customer VMs which are waiting for a LUKS passphrase on
their console.
This problem was the same as what was experienced with some of the
other servers a few months ago. With the months-long gap I had hoped
it was some undiagnosed kernel issue which we had got past, but
apparently not, as "jack" is on the latest available kernel package.
I'm pursuing some ideas about a config change that may help, and I
managed to put that into place before "jack" was rebooted - it does
require a reboot so if it does help it won't be able to take effect
on the others until next reboot. On the other hand it doesn't hurt
either, so I've made the same change elsewhere also.
If that doesn't fix things then the next line of investigation will
be an upgrade of the hypervisor to latest stable release, though
that is a rather major undertaking.
Apologies for the disruption. It is challenging to debug a problem
that can take several months to occur, with no reliable way of
triggering it. :(
Thanks,
Andy
--
https://bitfolk.com/ -- No-nonsense VPS hosting