On Sun, Sep 10, 2017 at 01:01:59AM +0000, Andy Smith wrote:
[the second, unclean restart of all VPSes on "elephant"]
I am really sorry that this happened. I could have
tested my
proposed action on test hardware but I was sure I had done it before
without incident, and I was wrong.
What happened was:
- One of the patches was not to the hypervisor itself (which is
booted like a kernel) but to one of the daemons that runs in dom0,
that being xenstored. The xenstore is the thing that records which
guests are running on the hypervisor and how they are configured.
The xenstored is dom0's interface to it.
The bug could have allowed a guest administrator to crash the
xenstored in dom0, leaving guests unmanageable (no start/stop).
Therefore there was an updated xen-utils package to install before
reboot.
- I forgot about this and carried out my usual procedure of:
- suspending VMs that opted to be suspended.
- shutting down every other VM.
- rebooting dom0.
- dom0 was already booted and starting VPSes by the time I realised
I had forgotten a step, but I had thought in the past that I had
been able to kill xenstored and start it again without issue, so
once all VPSes had booted I tried that.
- Once I killed xenstored and started it again various scripts I run
which read things out of the xenstore started reporting that they
couldn't do so. It was then that I realised that killing xenstored
is not designed to ever be stopped. I had confused my memory with
another daemon (xenconsoled) which is safe to stop and start.
- I spent some time trying various things but ultimately had to
accept that I'd have to reboot dom0 without even being able to
cleanly shut down any of the VPSes.
- Things went quite quickly once I had committed to that action and
as far as I can tell there was no ill-effect from the unclean
shutdown other than nobody on that host getting the
suspend/restore that they might have wanted.
- To add further irritation, the bug in xenstored can actually only
be triggered by a user with a HVM guest, which BitFolk doesn't yet
support, so in fact it would have been safe to not deploy that
particular fix. I hadn't considered that possibility as the fix
was in theory easy to deploy so I'd packaged it up without looking
too deeply into whether it was required (other fixes were required
so no escaping the work).
The main problems here were:
- Forgetting to install a fixed package.
I have a procedure of work documented for these situations, but
they normally involve just booting into a new hypervisor. I didn't
write up a plan for this work which differed in that a package
also needed to be upgraded first, because I thought it was a
simple enough deviation that I would just remember it.
In future I will try to document a plan for each of these
maintenance events, even if that is largely a copy of previous
plans. I think that will reduce the chance of making an error like
this again.
- Forgetting that you can't restart xenstored.
This is not something I am likely to forget again. It would be
worth noting in any maintenance plan that involves xenstored
though.
- Even deploying an unnecessary fix.
Arguably that particular package didn't need to be upgraded on any
BitFolk server, but on balance I think I would still want to do so
as if I'm deploying a security patch then I prefer to have *all*
security bugs patched where possible even if they are not
immediately applicable. I don't want to have to keep a separate
record of what modes of operation are no longer safe due to
cherry-picking of patches.
However, if I could go back in time to the point where I had just
rebooted "elephant" and it was all up and running, and knew that I
had forgotten to upgrade the xenstored package, I think a
different choice would have been correct.
I'd know that I could install the upgraded package and cleanly
shut everything down again, but knowing that it wouldn't actually
be fixing any viable vulnerability I think the correct choice
would just have been to note that "elephant" needs a reboot before
any HVM guests could ever be allowed to run on it.
Chances are that another reboot will happen before HVM happens,
and anyway my plans for HVM don't initially include mixing HVM and
PV guests on the same hardware.
Again, I feel a properly written plan would cover which upgrades
are necessary versus which ones are just nice to have.
So, apologies again for the longer than anticipated outage and rude
unclean shutdown that customers on "elephant" experienced this time.
I will try to do better in future.
Cheers,
Andy
--
https://bitfolk.com/ -- No-nonsense VPS hosting
Please consider the environment before reading this e-mail.
— John Levine