Write-up of the problems with the maintenance for "elephant" on 2017-09-10

14 Sep 2017

On Sun, Sep 10, 2017 at 01:01:59AM +0000, Andy Smith wrote:

[the second, unclean restart of all VPSes on "elephant"]

...
  I am really sorry that this happened. I could have
tested my
 proposed action on test hardware but I was sure I had done it before
 without incident, and I was wrong. 
What happened was:

- One of the patches was not to the hypervisor itself (which is
  booted like a kernel) but to one of the daemons that runs in dom0,
  that being xenstored. The xenstore is the thing that records which
  guests are running on the hypervisor and how they are configured.
  The xenstored is dom0's interface to it.

  The bug could have allowed a guest administrator to crash the
  xenstored in dom0, leaving guests unmanageable (no start/stop).
  Therefore there was an updated xen-utils package to install before
  reboot.

- I forgot about this and carried out my usual procedure of:

  - suspending VMs that opted to be suspended.

  - shutting down every other VM.

  - rebooting dom0.

- dom0 was already booted and starting VPSes by the time I realised
  I had forgotten a step, but I had thought in the past that I had
  been able to kill xenstored and start it again without issue, so
  once all VPSes had booted I tried that.

- Once I killed xenstored and started it again various scripts I run
  which read things out of the xenstore started reporting that they
  couldn't do so. It was then that I realised that killing xenstored
  is not designed to ever be stopped. I had confused my memory with
  another daemon (xenconsoled) which is safe to stop and start.

- I spent some time trying various things but ultimately had to
  accept that I'd have to reboot dom0 without even being able to
  cleanly shut down any of the VPSes.

- Things went quite quickly once I had committed to that action and
  as far as I can tell there was no ill-effect from the unclean
  shutdown other than nobody on that host getting the
  suspend/restore that they might have wanted.

- To add further irritation, the bug in xenstored can actually only
  be triggered by a user with a HVM guest, which BitFolk doesn't yet
  support, so in fact it would have been safe to not deploy that
  particular fix. I hadn't considered that possibility as the fix
  was in theory easy to deploy so I'd packaged it up without looking
  too deeply into whether it was required (other fixes were required
  so no escaping the work).

The main problems here were:

- Forgetting to install a fixed package.

  I have a procedure of work documented for these situations, but
  they normally involve just booting into a new hypervisor. I didn't
  write up a plan for this work which differed in that a package
  also needed to be upgraded first, because I thought it was a
  simple enough deviation that I would just remember it.

  In future I will try to document a plan for each of these
  maintenance events, even if that is largely a copy of previous
  plans. I think that will reduce the chance of making an error like
  this again.

- Forgetting that you can't restart xenstored.

  This is not something I am likely to forget again. It would be
  worth noting in any maintenance plan that involves xenstored
  though.

- Even deploying an unnecessary fix.

  Arguably that particular package didn't need to be upgraded on any
  BitFolk server, but on balance I think I would still want to do so
  as if I'm deploying a security patch then I prefer to have *all*
  security bugs patched where possible even if they are not
  immediately applicable. I don't want to have to keep a separate
  record of what modes of operation are no longer safe due to
  cherry-picking of patches.

  However, if I could go back in time to the point where I had just
  rebooted "elephant" and it was all up and running, and knew that I
  had forgotten to upgrade the xenstored package, I think a
  different choice would have been correct.

  I'd know that I could install the upgraded package and cleanly
  shut everything down again, but knowing that it wouldn't actually
  be fixing any viable vulnerability I think the correct choice
  would just have been to note that "elephant" needs a reboot before
  any HVM guests could ever be allowed to run on it.

  Chances are that another reboot will happen before HVM happens,
  and anyway my plans for HVM don't initially include mixing HVM and
  PV guests on the same hardware.

  Again, I feel a properly written plan would cover which upgrades
  are necessary versus which ones are just nice to have.

So, apologies again for the longer than anticipated outage and rude
unclean shutdown that customers on "elephant" experienced this time.
I will try to do better in future.

Cheers,
Andy

-- 
https://bitfolk.com/ -- No-nonsense VPS hosting

Please consider the environment before reading this e-mail.
 — John Levine

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

Write-up of the problems with the maintenance for "elephant" on 2017-09-10