Hi,
In the last couple of hours I unfortunately had to reboot host
"talisker" including full shutdown & boot of all the VMs on it.
It seems from logs that problems started at approximately 01:00. The
first alerts came in at 01:22 when customers started trying to
reboot their VMs. Symptoms for customers were stalling of tasks,
unable to shut down properly, unable to boot again after forcibly
shutting down.
I spent some time trying to investigate but it wasn't making things
any better so by about 02:30 I decided to issue a reboot. Customer
VMs were all back up and running by about 02:45.
I continue to investigate what the root cause may be and am keeping
a close eye on things.
Apologies for the disruption this will have caused you.
Thanks,
Andy
--
https://bitfolk.com/ -- No-nonsense VPS hosting
Hi,
==TL;DR: version==
You can now perform a mostly-automated install of CentOS 8.x from
our Xen Shell:
https://tools.bitfolk.com/wiki/Using_the_self-serve_net_installer
xen shell> install centos_8
==Full version==
Installing CentOS 8 at BitFolk has previously only been possible by
booting the Rescue VM and doing it in a chroot:
https://tools.bitfolk.com/wiki/Installing_CentOS_8
This is because as of CentOS 8, Red Hat decided to disable support
for PV and PVH mode Xen guests in all their kernels, even though the
upstream Linux kernel does have that supported by default.
Thanks to some work by Jon Fautley¹ in hacking together a modified
installer kernel and initrd for CentOS and RHEL we were able to
boot the installer anyway, so now a more normal install experience
is possible.
It is still necessary for CentOS users to switch to the kernel-ml
kernel package from ElRepo, so our installer does that for you.
===But isn't CentOS 8 dead?===
Red Hat recently moved the EOL date for CentOS 8 forward from 2029
to 31 December 2021. After that point, existing CentOS 8 users would
need to switch to CentOS Stream or some other distribution.
We would like to support CentOS Stream, as well as RHEL and perhaps
one of the more popular CentOS replacements (e.g. Rocky Linux)
should they ever make a release. This work was necessary for that.
===Should I install CentOS 8?===
Probably not given its short remaining lifespan, unless you want to
switch it to CentOS 8 Stream or RHEL8 later.
If you do we'd like to know how you get on with our installer. It's
only received light testing so far.
CentOS 7 is still security supported by the CentOS Linux project
until 30 June 2024.
===What is CentOS Stream?===
I'm not going to try to explain what Red Hat's product lineup is. As
far as I understand it's a rolling release, i.e. constantly updated,
with packages that are about to go into the corresponding RHEL
release. Red Hat does not recommend it for production use.
Red Hat's announcement is here:
https://www.redhat.com/en/blog/faq-centos-stream-updates
===Why are you considering offering RHEL?===
As of 1 February 2021 Red Hat is allowing its free Red Hat Developer
subscription to have up to 16 active servers:
https://www.redhat.com/en/blog/new-year-new-red-hat-enterprise-linux-progra…
It should be possible for us to support this soon, though with the
same caveat that it will likely be necessary to use the kernel-ml
EPEL package.
===Other questions?===
Please do ask if there's anything else.
Cheers,
Andy
¹ https://guv.cloud/
--
https://bitfolk.com/ -- No-nonsense VPS hosting
Hi,
As of about 0000Z we started receiving alerts of packet loss and
began investigation. It was found to be an internal issue between hosts
"clockwork" and "limoncello" only. That is, everything on both hosts
was reachable from outside our network and also from inside as long
as it wasn't between those two hosts.
As there is a monitoring node on "limoncello", a number of alerts
were sent out regarding customer services on "clockwork" that it
considered to be down, but they weren't actually down - unless you
happened to be hosted on "limoncello", anyway, and vice versa.
I tracked the issue to one of the two bonded switch ports for
"clockwork"; bringing that interface down and up again appears to
have cleared it. That happened at about 0045Z.
If the problem reoccurs we can down the interface and have it run on
one interface until the port or switch can be changed. If the
problem is actually in the NIC of the server itself things will be
more tricky, but we'll cross that bridge if we come to it.
Cheers,
Andy
--
https://bitfolk.com/ -- No-nonsense VPS hosting
Hi,
At around 00:15Z we started receiving alerts regarding some servers
on host "elephant". Looking at the machine's console it was
reporting errors with its SAS controller, and was generally
unresponsive to anything requiring block IO, so I had no choice but
to power cycle it.
On boot I couldn't find any issue with its SAS controller, and it
was able to find all its storage devices and seemingly boot
normally. The last few customer VPSes have finished booting as I
type this.
I will keep an eye on things for the next few hours and let you know
about further actions. Please accept my apologies for the
disruption.
This is unrelated to the problems with "elephant" last month which
were tracked down to a kernel bug.
Cheers,
Andy
--
https://bitfolk.com/ -- No-nonsense VPS hosting
Hi,
A few days ago someone asked if we would match a 5% discount that
another hosting company offered to developers of significant open
source projects. After thinking about it for a bit I decided we
would.
So, if you are a registered Debian/Ubuntu Developer, Fedora
maintainer, BSD committer etc etc please feel free to email
support(a)bitfolk.com and ask if you qualify. When you do, please
provide some sort of link to your work so we can verify it.
More information here:
https://tools.bitfolk.com/wiki/Developer_discount
Cheers,
Andy
--
https://bitfolk.com/ -- No-nonsense VPS hosting
Hi,
On Fri, Oct 23, 2020 at 11:46:11AM +0000, Andy Smith wrote:
> On Fri, Oct 23, 2020 at 11:19:21AM +0000, Andy Smith wrote:
> > I'm trying to isolate the issue to one particular VM because if a
> > guest can crash the host then it's a bug in the hypervisor and
> > just moving guests around won't solve the problem.
>
> I can't find it. As we have had problems with elephant before I'm
> going to assume hardware problem and start moving customer VMs to
> other hosts.
While moving customer VMs to other hosts, booting one of them caused
server "macallan" to crash in exactly the same way. So, I am ruling
out hardware issues with "elephant".
By preventing this particular VM from booting I was able to boot all
of the other VMs on "macallan". I have some hope that it is just
this one VM that is tickling a particularly nasty bug.
I am going to try now starting the remainder of VMs on "elephant".
If that is successful I will then take the suspect VM to test
hardware to see if I can further reproduce.
I am confused because I am sure I tried reverting last weekend's
hypervisor upgrade to the previous version while investigating
matters on "elephant", yet it still crashed. Possibly I made a
mistake (e.g. booted with wrong hypervisor).
Also, everything obviously booted up fine last weekend when I did
the maintenance so possibly this customer has found a new and
unrelated bug.
The best case at this point is that I can reproduce the problem with
just that one VM, report it, get it fixed and then have to reboot
everything to deploy the fix.
Cheers,
Andy
--
https://bitfolk.com/ -- No-nonsense VPS hosting
Hi,
Server "elephant" unexpectedly crashed, then crashed twice more
shortly after rebooting but before completely starting all VPSes. It
is now crashing every time while trying to boot VPSes. I suspected
bug in last round of XSA patches so reverted to previous hypervisor,
but problem persists. We had an issue with "elephant" not so long
ago so it might be hardware fault *though no logs to back this up).
Still investigating, sorry.
--
https://bitfolk.com/ -- No-nonsense VPS hosting
Hello,
Unfortunately - and annoyingly only a month since the last lot -
some serious security bugs have been discovered in the Xen
hypervisor and fixes for these have now been pre-disclosed, with an
embargo that ends at 1200Z on 20 October 2020.
As a result we will need to apply these fixes and reboot everything
before that time. We are likely to do this in the early hours of the
morning UK time, on 17, 18 and 19 October.
In the next few days individual emails will be sent out confirming
to you which hour long maintenance window your services are in. The
times will be in UTC; please note that UK is currently observing
daylight savings and as such is currently at UTC+1. We expect the
work to take between 15 and 30 minutes per bare metal host.
If you have opted in to suspend and restore¹ then your VM will be
suspended to storage and restored again after the host it is on is
rebooted. Otherwise your VM will be cleanly shut down and booted
again later.
If you cannot tolerate the downtime then please contact
support(a)bitfolk.com. We may be able to migrate² you to
already-patched hardware before the regular maintenance starts. You
can expect a few tens of seconds of pausing in that case. This
process uses suspend&restore so has the same caveats.
It is disappointing to have another round of security reboots 28
days after the last lot, though before that there was a gap of about
330 days. Still, as there are security implications we have no
choice in the matter.
Cheers,
Andy
¹ https://tools.bitfolk.com/wiki/Suspend_and_restore
² https://tools.bitfolk.com/wiki/Suspend_and_restore#Migration
--
https://bitfolk.com/ -- No-nonsense VPS hosting
Hi,
A reminder that maintenance is scheduled for the early hours (UK
time) of 17, 18 and 19 October.
Irritatingly, this may end up having to be postponed. One of the
patches has problems and the vendor is still working on that. If
they come up with something in the next few hours I will still have
time to test it appropriately, but if they don't then I won't and
we'll have to postpone this maintenance for one week.
Please assume it is going ahead unless you are notified otherwise.
You should have all received a direct email telling you the hour
long maintenance window that each of your VMs is in. If you can't
find it please check your spam folders etc; it was sent on 7
October.
If you still can't find it, work out which host machine you're on¹,
and then the maintenance windows are:
elephant 2020-10-17 00:00
hen 2020-10-18 02:00
hobgoblin 2020-10-18 01:00
jack 2020-10-19 00:00
leffe 2020-10-19 01;00
macallan 2020-10-17 02:00
paradox 2020-10-18 00:00
snaps 2020-10-19 02:00
talisker 2020-10-17 03:00
These times are all in UTC so add 1 hour for UK time (BST).
Cheers,
Andy
¹ This is listed on https://panel.bitfolk.com/ and is also evident
from resolving <accountname>.console.bitfolk.com in DNS, e.g.:
$ host ruminant.console.bitfolk.comruminant.console.bitfolk.com is an alias for console.leffe.bitfolk.com.
console.leffe.bitfolk.com is an alias for leffe.bitfolk.com.
leffe.bitfolk.com has address 85.119.80.22
leffe.bitfolk.com has IPv6 address 2001:ba8:0:1f1::d
----- Forwarded message from Andy Smith <andy(a)bitfolk.com> -----
Date: Wed, 7 Oct 2020 09:20:29 +0000
From: Andy Smith <andy(a)bitfolk.com>
To: announce(a)lists.bitfolk.com
Subject: [bitfolk] Reboots will be necessary to address security issues, probably early hours 17/18/19
October
User-Agent: Mutt/1.5.23 (2014-03-12)
Reply-To: users(a)lists.bitfolk.com
Hello,
Unfortunately - and annoyingly only a month since the last lot -
some serious security bugs have been discovered in the Xen
hypervisor and fixes for these have now been pre-disclosed, with an
embargo that ends at 1200Z on 20 October 2020.
As a result we will need to apply these fixes and reboot everything
before that time. We are likely to do this in the early hours of the
morning UK time, on 17, 18 and 19 October.
In the next few days individual emails will be sent out confirming
to you which hour long maintenance window your services are in. The
times will be in UTC; please note that UK is currently observing
daylight savings and as such is currently at UTC+1. We expect the
work to take between 15 and 30 minutes per bare metal host.
If you have opted in to suspend and restore¹ then your VM will be
suspended to storage and restored again after the host it is on is
rebooted. Otherwise your VM will be cleanly shut down and booted
again later.
If you cannot tolerate the downtime then please contact
support(a)bitfolk.com. We may be able to migrate² you to
already-patched hardware before the regular maintenance starts. You
can expect a few tens of seconds of pausing in that case. This
process uses suspend&restore so has the same caveats.
It is disappointing to have another round of security reboots 28
days after the last lot, though before that there was a gap of about
330 days. Still, as there are security implications we have no
choice in the matter.
Cheers,
Andy
¹ https://tools.bitfolk.com/wiki/Suspend_and_restore
² https://tools.bitfolk.com/wiki/Suspend_and_restore#Migration
--
https://bitfolk.com/ -- No-nonsense VPS hosting
----- End forwarded message -----
Hello,
Unfortunately some serious security bugs have been discovered in the
Xen hypervisor and fixes for these have now been pre-disclosed, with
an embargo that ends at 1200Z on 22 September 2020.
As a result we will need to apply these fixes and reboot everything
before that time. We are likely to do this in the early hours of the
morning UK time, on 19, 20 and 21 September.
In the next few days individual emails will be sent out confirming
to you which hour long maintenance window your services are in. The
times will be in UTC; please note that UK is currently observing
daylight savings and as such is currently at UTC+1. We expect the
work to take between 15 and 30 minutes per bare metal host.
If you have opted in to suspend and restore¹ then your VM will be
suspended to storage and restored again after the host it is on is
rebooted. Otherwise your VM will be cleanly shut down and booted
again later.
If you cannot tolerate the downtime then please contact
support(a)bitfolk.com. We may be able to almost-live migrate you to
already-patched hardware before the regular maintenance starts. You
can expect a few tens of seconds of pausing in that case. This is
still a somewhat experimental process and also requires you to opt
in to suspend and restore.
Cheers,
Andy
¹ https://tools.bitfolk.com/wiki/Suspend_and_restore
--
https://bitfolk.com/ -- No-nonsense VPS hosting