Author: James Stanley Date: To: announce CC: users Subject: Re: [bitfolk] Emergency reboot of sol.bitfolk.com required
This sort of thing is why I love bitfolk. You are so well on top of
every little problem that occurs.
Keep on keeping on!
James
On Sun, 5 Jan 2014 21:41:12 +0000
Andy Smith <andy@???> wrote:
> Hi,
>
> On Sun, Jan 05, 2014 at 07:52:32PM +0000, Andy Smith wrote:
> > I apologise for the disruption and I hope to be able to give more
> > information later. I will follow up again when all customr VPSes
> > are known to have booted.
>
> All customers VPSes are believed to have booted as of about 2050Z.
> If yours hasn't, please check out its console. Our Nagios thinks
> that everything that was up before is back up again now.
>
> What appears to have happened is that a customer VPS earlier this
> afternoon was rebooted while under extreme memory pressure (it was
> OOM-killing a lot) and the slow shutdown of that appears to have hit
> a race condition in the host kernel which lead to the xenwatch
> kernel thread being left in an uninterruptible 'D' state.
>
> In that state, the host was unable to create any further virtual
> network devices, so this customer could not complete their reboot
> nor could they launch the rescue environment. As no network devices
> could be created, no VPS could in fact be started which is why a
> full reboot of the host was necessary.
>
> This appears to be a known bug and there is probably a fix for it
> that we can apply, but I did not want to do that in the middle of
> this semi-emergency reboot.
>
> This kernel version is in use on several hosts with combined uptimes
> of thousands of days so I do not think this is a commonly-hit bug
> and I would rather take a little bit of time to research the fix and
> then schedule a reboot for kernel upgrade with plenty of notice.
>
> I will keep you informed about that.
>
> Cheers,
> Andy
>