Hi,
At around 00:33 BST (23:33Z) we started to receive alerts regarding
services on host "clockwork". Upon investigation it was showing all
the signs of being the intermittent "frozen I/O" problem we've been
having:
https://lists.bitfolk.com/lurker/message/20210425.071102.9d9a1cc5.en.html
As mentioned in that earlier email, I'd decided that the next step
would be prepare new hypervisor packages and I did do that the next
day.
As this issue seems to happen only every few months and on different
servers we do not yet know if the new packages fix the problem.
They've been in use on other servers since late April without
incident, but that isn't yet proof enough given the long periods
between occurrences.
Anyway, after "clockwork" was power cycled the new packages were
installed there and then all VMs were started again. This was
completed by about 01:16 BST (00:16Z).
There are still many of our servers where we know this is going to
happen again at some point. I don't feel comfortable scheduling
maintenance to upgrade them when I don't know if the upgrade will be
effective. If we can go a significant period of time on the newer
version without incident then we will schedule a maintenance window
to get the remaining servers on those versions too. It is also
possible that there will be a security patch that forces a
maintenance, in which case we'll upgrade the hypervisor packages to
the newer version at the same time.
There are also some servers still left to be emptied so that their
operating systems can be upgraded. Those are "hen" and "paradox".
Once they are emptied and upgraded they will of course be put on the
newer version of the hypervisor. It is expected that customer
services we move from these servers will be put on servers that
already have the newer hypervisor version.
Thank you for your patience and I apologise for the disruption. I'm
doing all that I can to try to find a solution.
Thanks,
Andy
--
https://bitfolk.com/ -- No-nonsense VPS hosting
Hi,
This is your one week to go reminder that there will be scheduled
maintenance on four servers starting at 23:00 BST (22:00Z) on
Thursday 27 May 2021. The four servers affected will be:
- elephant.bitfolk.com
- limoncello.bitfolk.com
- snaps.bitfolk.com
- talisker.bitfolk.com
The maintenance window is 3 hours long but we expect the work to
take less than 30 minutes per server.
A direct email has also been sent out to contacts for all customers
on those servers.
If you cannot tolerate the ~30 minutes of downtime at that time
please reply to the email to open a support ticket asking for your
service to be moved about. That will take place at a time of your
choosing in the next week, but please ask early if you need it.
There are further details here:
https://tools.bitfolk.com/wiki/Maintenance/2021-05-Re-racking
Please note that we are still moving customers around due to rolling
software upgrades on our servers. That is unrelated to this work,
but right now customers are being moved off of snaps.bitfolk.com.
Possibly that server will be emptied before this work takes place.
The above wiki page tells you how to work out which server your
service is on.
Any other questions, please do let us know!
Cheers,
Andy
--
https://bitfolk.com/ -- No-nonsense VPS hosting
Hi,
Tonight from around 21:00 BST onwards we started getting alerts and
support tickets regarding customer services on host
jack.bitfolk.com.
I had a look and unfortunately it appears to be a re-occurrence of
previous issues regarding stalled IO:
https://lists.bitfolk.com/lurker/message/20210425.071102.9d9a1cc5.en.htmlhttps://lists.bitfolk.com/lurker/message/20210220.032844.00dc9600.en.htmlhttps://lists.bitfolk.com/lurker/message/20201116.003514.25278824.en.html
Things were largely unresponsive so I had to forcibly reboot the
server. Customer services were all booted or in the process of
booting by about 21:53.
What we know so far:
- It must be a software issue, not a hardware issue, as it's
happened on multiple servers of different specifications.
- It's only happening with servers that we've upgraded to Xen
version 4.12.x.
- It's going to be really difficult to track down because there can
be months between occurrences.
Each time this has occurred I've made some change that I'd hoped
would lead to a solution, but I've now tried all the easy things and
so all that remains is to do another software upgrade.
I think we're going to have to build some packages for Xen 4.14 and
install that on a test host and see how it goes. The difficulty is
that once I do that and it seems to work, we'll never really know
because it could just be in the long period of time where the
problem is not triggered. Clearly once we have seemingly-working
packages we can't leave them spinning for 6 months just to reassure
ourselves of that.
I also am unsure about whether it is a good idea to force additional
downtimes on you in order to upgrade servers to 4.14.x when I don't
even know yet if that will fix the issue. What I can do is have the
upgrade ready and then if/when the issue re-occurs do the upgrade
then, so it boots into that.
Anyway, all I can say is that this is a really unfortunate state of
affairs that obviously I'm not happy with and I'm doing all that I
can to resolve it. These outages are unacceptable and rest assured
they are aggravating me more than anyone else.
Thanks,
Andy Smith
BitFolk Ltd
--
https://bitfolk.com/ -- No-nonsense VPS hosting
Hi,
At approximately 17:28 BST we started receiving numerous alerts for
server "macallan" and customer services on it. Upon investigation I
was unable to connect to the IPMI console of the server.
I got in contact with the colo provider who quickly realised that
they were doing work in that rack and had knocked out the power
cable for this server.
The server started booting around 17:35 and all customer VMs had
booted by 17:47.
We use locking power cables in our servers to try to minimise this
sort of thing, but they only lock at one end - the server end. The
server's power cord had come loose at the other end.
"macallan" is one of our older servers which is single power supply
unit. To mitigate that risk it plugs into a automatic transfer
switch so that its single PSU continues to receive power even if one
of the rack's two PDUs or power feeds fails. Unfortunately that does
not protect it against its single power cord coming out of the ATS.
We have started a hardware refresh and the new spec servers do have
dual PSUs which should help to avoid things like this in future.
Please accept my apologies for this disruption.
Andy Smith
BitFolk Ltd
--
https://bitfolk.com/ -- No-nonsense VPS hosting
Hi,
TL;DR: There's 21 serious security vulnerabilities recently
published for the Exim mail server, 10 of which are remotely
triggerable. Anyone running Exim needs to patch it ASAP or risk
having their server automatically root compromised as soon as an
exploit is cooked up. Which may have happened already.
Details: https://lwn.net/Articles/855282/
We don't usually post about other vendors' security issues on the
announce@ list but I'm making an exception for this one because Exim
is installed by default on all versions of Debian, and more than
60% of BitFolk customers use some version of Debian.
If you're running Exim you need to upgrade it immediately. Package
updates have already been posted for Debian 9 and 10
(stretch/oldstable and buster/stable). The last time this sort of
thing happened with Exim several customers were automatically
compromised. As it's a root level compromise, if it happens to you
then you will never be sure what exactly what done to your server.
You might end up needing to reinstall it.
Most hosts, unless they are acting as a server listed in one or more
domains' MX records, do not need to be remotely accessible on port
25. If that's the case for you then you would be well advised to
reconfigure Exim to only listen on localhost. Though there are still
11 other vulnerabilities that local users could exploit. At least
you'd only get rooted by a friend, right?
An exploit hasn't been published yet but that doesn't mean that one
doesn't exist, and now that the source changes are public it should
be fairly easy for developers to work out how to do it.
Some of the bugs go back to 2004 so basically every Exim install is
at risk. If you are running a release of Debian prior to version 9
(stretch) then it's out of security support and may not ever see an
updated package for this, so you need to strongly consider turning
off any Exim server and doing an OS upgrade before you turn it back
on.
If you need help, you could reply to this and seek help from other
customers, or BitFolk can help you as a consultancy service, but you
probably don't want to pay consultancy prices and in any moderately
complicated setup our approach is going to be an OS upgrade anyway.
Email support(a)bitfolk.com to discuss if still interested in that.
Best of luck with the upgrading!
Andy
--
https://bitfolk.com/ -- No-nonsense VPS hosting
Hi,
After receiving a number of alerts for VMs hosted on server "jack",
I investigated and found the server largely unresponsive.
Unfortunately I had no option but to forcibly reboot it, which I did
at about 06:47Z
It's now 07:01Z and monitoring says everything is back up, except
for two customer VMs which are waiting for a LUKS passphrase on
their console.
This problem was the same as what was experienced with some of the
other servers a few months ago. With the months-long gap I had hoped
it was some undiagnosed kernel issue which we had got past, but
apparently not, as "jack" is on the latest available kernel package.
I'm pursuing some ideas about a config change that may help, and I
managed to put that into place before "jack" was rebooted - it does
require a reboot so if it does help it won't be able to take effect
on the others until next reboot. On the other hand it doesn't hurt
either, so I've made the same change elsewhere also.
If that doesn't fix things then the next line of investigation will
be an upgrade of the hypervisor to latest stable release, though
that is a rather major undertaking.
Apologies for the disruption. It is challenging to debug a problem
that can take several months to occur, with no reliable way of
triggering it. :(
Thanks,
Andy
--
https://bitfolk.com/ -- No-nonsense VPS hosting
Hi,
We don't have 100% firm date for this yet and I don't want to
announce it properly while the other moves are happening but I also
want to try to give a month of notice: The servers that are not
getting relocated today & 27 April will very likely be relocated on
27 May.
That is:
- elephant
- limoncello
- snaps
- talisker
Now because of the movement of individual customers that is still
taking place, it is very likely that everyone who's currently on
"snaps" will be moved off of it before then, in which case we will
relocate it at a time convenient to us as it won't affect anyone.
The other three have already had an OS upgrade so they definitely
will be relocated all together on the same night, which as I say is
very likely to be 27 May.
As before I will send another mail here when we have confirmation
and then we will send a direct email a week before to everyone who
will be affected. And as before it will be possible to have your
service moved about ahead of time at a time of your choosing.
While I'm here a reminder that:
- Server "hen" is being relocated tonight at some point between
21:00 and 23:00 BST (20:00 to 22:00 UTC). Those affected would
have received individual notifications about that a week ago.
- Servers "clockwork", "hobgoblin", "jack", "leffe", "macallan" and
"paradox" will be relocated on 27 April and later today all those
who will be affected will receive a direct email about this.
The full details of those works are at:
https://tools.bitfolk.com/wiki/Maintenance/2021-04-Re-racking
Cheers,
Andy
--
https://bitfolk.com/ -- No-nonsense VPS hosting
Hello all,
A web version of this email with any updates that have come to light
since posting is available here:
https://tools.bitfolk.com/wiki/Maintenance/2021-04-Re-racking
== TL;DR:
We need to relocate some servers to a different rack within
Telehouse.
On Tuesday 20 April 2021 at some point in the 2 hour window starting
at 20:00Z (21:00 BST) all customers on the following server will
have their VMs either powered off or suspended to storage:
* hen.bitfolk.com
We expect to have it powered back on within 30 minutes.
On Tuesday 27 April 2021 at some point in the 4 hour window starting
at 22:00Z (23:00 BST) all customers on the following servers will
have their VMs either powered off or suspended to storage:
* clockwork.bitfolk.com
* hobgoblin.bitfolk.com
* jack.bitfolk.com
* leffe.bitfolk.com
* macallan.bitfolk.com
* paradox.bitfolk.com
We expect the work on each server to take less than 30 minutes.
See "Frequently Asked Questions" at the bottom of this email for how
to determine which server your VM is on.
If you can't tolerate a ~30 minute outage at these times then please
contact support as soon as possible to ask for your VM to be moved
to a server that won't be part of this maintenance.
== Maintenance Background
Our colo provider needs to rebuild one of their racks that houses 7
of our servers. This is required because the infrastructure in the
rack (PDUs, switches etc) is of a ten year old vintage and all needs
replacing. To facilitate this, all customer hardware in that rack
will need to be moved to a different rack or sit outside of the rack
while it is rebuilt. We are going to have to move our 7 servers to a
different rack.
This is a significant piece of work which is going to affect several
hundred of our customers, more than 70% of the customer base.
Unfortunately it is unavoidable.
== Networking upgrade
We will also take the opportunity to install 10 gigabit NICs in the
servers which are moved. The main benefit of this will be faster
inter-server data transfer for when we want to move customer
services about. The current 1GE NICs limit this to about 90MiB/sec.
== Suspend & Restore
If you opt in to suspend & restore then instead of shutting your VM
down we will suspend it to storage and then when the server boots
again it will be restored. That means that you should not experience
a reboot, just a period of paused execution. You may find this less
disruptive than a reboot, but it is not without risk. Read more
here:
https://tools.bitfolk.com/wiki/Suspend_and_restore
== Avoiding the Maintenance
If you cannot tolerate a ~30 minute outage during the maintenance
windows listed above then please contact support to agree a time
when we can move your VM to a server that won't be part of the
maintenance.
Doing so will typically take just a few seconds plus the time it
takes your VM to shut down and boot again and nothing will change
about your VM.
If you have opted in to suspend & restore then we'll use this to do
a "semi-live" migration. This will appear to be a minute or two of
paused execution.
Moving your VM is extra work for us which is why we're not doing it
by default for all customers, but if you prefer that to experiencing
the outage then we're happy to do it at a time convenient to you, as
long as we have time to do it and available spare capacity to move
you to. If you need this then please ask as soon as possible to
avoid disappointment.
It won't be possible to change the date/time of the planned work on
an individual customer basis. This work involves 7 of our servers,
will affect several hundred of our customers, and also has needed to
be scheduled with our colo provider and some of their other
customers. The only per-customer thing we may be able to do is move
your service ahead of time at a time convenient to you.
== Rolling Upgrades Confusion
We're currently in a cycle of rolling software upgrades to our
servers. Many of you have already received individual support
tickets to schedule that. It involves us moving your VM from one of
our servers to another and full details are given in the support
ticket.
This has nothing to do with the maintenance that's under discussion
here and we realise that it's unfortunately very confusing to have
both things happening at the same time. We did not know that moving
our servers would be necessary when we started the rolling upgrades.
We believe we can avoid moving any customer from a server that is
not part of this maintenance onto one that will be part of this
maintenance. We cannot avoid moving customers between servers that
are both going to be affected by this maintenance. For example, at
the time of writing, customer services are being moved off of
jack.bitfolk.com and most of them will end up on
hobgoblin.bitfolk.com.
== Further Notifications
Every customer is supposed to be subscribed to this announcement
mailing list, but no doubt some aren't. The movement of customer
services between our servers may also be confusing for people, so we
will send a direct email notification to the main contact of
affected customers a week before the work is due to take place.
So, on Tuesday 13 April we'll send a direct email about this to
customers that are hosted on hen.bitfolk.com, and then on Tuesday 20
April we'll send a similar email to customers on all the rest of the
affected servers.
== 20 April Will Be a Test Run
We are only planning to move one server on 20 April. The reasons for this are:
* We want to check our assumptions about how long this work will
take, per server.
* We're changing the hardware configuration of the server by adding
10GE NICs, and we want to make sure that configuration is stable.
The timings for the maintenance on 27 April may need to be altered
if the work on 20 April shows our guesses to be wildly wrong.
== Frequently Asked Questions
=== How do I know if I will be affected?
If your VM is hosted on one of the servers that will be moved then
you are going to be affected. There's a few different ways that you
can tell which server you are on:
1. It's listed on https://panel.bitfolk.com/
2. It's in DNS when you resolve <youraccountname>.console.bitfolk.com
3. It's on your data transfer email summaries
4. You can see it on a `traceroute` or `mtr` to or from your VPS.
=== If you can "semi-live" migrate VMs, why don't you just do that?
* This maintenance will involve some 70% of our customer base, so we
don't actually have enough spare hardware to move customers to.
* Moving the data takes significant time at 1GE network speeds.
For these reasons we think that it will be easier for most customers
to just accept a ~30 minute outage. Those who can't tolerate such a
disruption will be able to have their VMs moved to servers that
aren't going to be part of the maintenance.
=== Why are you needing to test out adding 10GE NICs to a live server? Isn't it already tested?
The main reason for running through this process first on one server
only (hen) is to check timings and procedure before doing it on
another six servers all at once. The issue of installing 10GE NICs
is a secondary concern and considered low risk.
The hardware for all 7 of the servers that are going to be moved is
obsolete now, so it's not possible to obtain identical spares now.
The 10GE NICs have been tested in general, but not with this
specific hardware, so it's just an extra cautionary measure.
The 10GE NICs will not be in use immediately in order to avoid too
much change at once, but this still does involve plugging in a PCIe
card which on boot will load a kernel module so while the risk is
considered low, it's not zero.
=== Further questions?
If there's anything we haven't covered or you need clarified please
do ask here or privately to support.
--
https://bitfolk.com/ -- No-nonsense VPS hosting