Hello,
On Tue, Sep 08, 2020 at 04:51:52PM +0000, Andy Smith wrote:
Unfortunately some serious security bugs have been
discovered in the
Xen hypervisor and fixes for these have now been pre-disclosed, with
an embargo that ends at 1200Z on 22 September 2020.
As a result we will need to apply these fixes and reboot everything
before that time. We are likely to do this in the early hours of the
morning UK time, on 19, 20 and 21 September.
This maintenance work has now been completed, without incident. The
details of the security issues which were fixed will appear at:
https://xenbits.xen.org/xsa/
after 1200Z on 22 September. We also took the opportunity to upgrade
CPU microcode where available.
Thanks for your patience during this disruption.
The rest of this email is some comments about suspend and restore so
if you have no interest in that it's safe to stop reading now.
During the course of this work 3 VMs were almost-live migrated¹. All
three worked fine.
96 VMs were suspended and restored². 94 of them appeared to cope
fine; 2 failed to restore properly.
One of the failures was a Debian buster VPS which didn't respond to
pings after restore. This was noticed by monitoring and the VPS was
then cleanly shut down and booted, after which it worked. Many
Debian buster VMs were suspended and restored so I do not think this
is a general problem with the kernel in buster but perhaps something
with the particular kernel modules in use in that case.
The other failure was an Ubuntu 16.04 VPS. Unfortunately this did
continue to respond to pings, but every process was hung. This was
not noticed until the customer investigated many hours later and
they had to use the Xen Shell "destroy" command then boot it again.
When customers opt-in to suspend&restore we add a ping monitor so we
stand some chance of noticing if the restore should fail, and can
then take action on your behalf. Clearly there are failure modes
where your kernel is able to respond to a ping but some or all
processes don't work properly. It would be a good idea to ask for
additional checks of whatever services you are running.
We're not really in a position to actively debug suspend and restore
problems aside from recommending that as new a kernel as
possible/convenient is used. We can certainly provide information if
any of you want to open a bug report with your Linux distribution or
the upstream Linux kernel.
You can learn more about suspend and restore here:
https://tools.bitfolk.com/wiki/Suspend_and_restore
Cheers,
Andy
¹ This involves syncing the storage and a dump of the memory image
between servers, so typically involves a pause in execution of
30–60 seconds. It is still experimental so we won't do it unless
you specially ask, and have patched destination hardware
available.
² Memory dumped to storage, restored again after the bare metal host
is rebooted. Typically involves a pause in execution of 10–20
minutes. We will use this method if you opt in to it from:
https://panel.bitfolk.com/account/config/
--
https://bitfolk.com/ -- No-nonsense VPS hosting