Hi,
At about 19:30Z we started receiving alerts for customer services on
server "limoncello".
On investigation it quickly became apparent that this was the
intermittent "I/O stall" problem we've been seeing on all servers
and have been grappling with for months now.
All I could do was power cycle the server.
My current line of investigation is to upgrade both the hypervisor
and the kernel when this happens, and so far it hasn't reoccurred on
any of the servers where that has been done, though the sometimes
months long gap between incidents means it's not possible to be
sure.
Although this last happened 16 days ago, that was on a different
server ("jack").
With the upgrades done the server was rebooted again and at about
20:28Z customer VMs started booting again. This was complete by
about 20:45Z.
Cheers,
Andy
--
https://bitfolk.com/ -- No-nonsense VPS hosting
_______________________________________________
announce mailing list
announce(a)lists.bitfolk.com
https://lists.bitfolk.com/mailman/listinfo/announce
Hi,
At about 05:23Z we started receiving alerts for customer services on
server "jack". There had been some alerts for about 40 minutes
before that, but they weren't serious enough to send push
notifications, only emails.
On investigation it quickly became apparent that this was the
intermittent "I/O stall" problem we've been seeing on all servers
and have been grappling with for months now.
All I could do was power cycle the server, which happened at about
05:30Z.
My current line of investigation is to upgrade both the hypervisor
and the kernel when this happens, and so far it hasn't reoccurred on
any of the servers where that has been done, though the sometimes
months long gap between incidents means it's not possible to be
sure.
With the upgrades done, the server was rebooted again and at about
05:54Z customer VMs started booting again. This was complete by
about 06:08Z.
Cheers,
Andy
--
https://bitfolk.com/ -- No-nonsense VPS hosting
_______________________________________________
announce mailing list
announce(a)lists.bitfolk.com
https://lists.bitfolk.com/mailman/listinfo/announce
Hi,
At about 20:37Z we started receiving alerts for customer services on
server "hobgoblin". It quickly became apparent that this was the
intermittent "I/O stall" problem we've been seeing on all servers
and have been grappling with for months now.
All I could do was power cycle the server, which happened at about
20:58Z.
We're still not able to reproduce the problem on demand and it can
be several months between incidents. We've tried upgrading
hypervisor and that's not helped. It's looking more like a problem
in the Linux kernel. So, I upgraded that as well to a newer
self-made package.
I've been communicating with a couple of the linux-raid devs and we
have some ideas but gathering information and making changes is
going slowly because of the lack of reproducibility and long time
between incidents. It's basically a case of making a single change
any time there is an issue.
With the upgrades done, the server was rebooted again and at about
21:19Z customer VMs started booting again. This was complete by
about 21:33Z.
Obviously I am not happy with these outages and I'm doing everything
I can to find the root cause.
Cheers,
Andy
--
https://bitfolk.com/ -- No-nonsense VPS hosting
_______________________________________________
announce mailing list
announce(a)lists.bitfolk.com
https://lists.bitfolk.com/mailman/listinfo/announce
Hi,
At around 00:33 BST (23:33Z) we started to receive alerts regarding
services on host "clockwork". Upon investigation it was showing all
the signs of being the intermittent "frozen I/O" problem we've been
having:
https://lists.bitfolk.com/lurker/message/20210425.071102.9d9a1cc5.en.html
As mentioned in that earlier email, I'd decided that the next step
would be prepare new hypervisor packages and I did do that the next
day.
As this issue seems to happen only every few months and on different
servers we do not yet know if the new packages fix the problem.
They've been in use on other servers since late April without
incident, but that isn't yet proof enough given the long periods
between occurrences.
Anyway, after "clockwork" was power cycled the new packages were
installed there and then all VMs were started again. This was
completed by about 01:16 BST (00:16Z).
There are still many of our servers where we know this is going to
happen again at some point. I don't feel comfortable scheduling
maintenance to upgrade them when I don't know if the upgrade will be
effective. If we can go a significant period of time on the newer
version without incident then we will schedule a maintenance window
to get the remaining servers on those versions too. It is also
possible that there will be a security patch that forces a
maintenance, in which case we'll upgrade the hypervisor packages to
the newer version at the same time.
There are also some servers still left to be emptied so that their
operating systems can be upgraded. Those are "hen" and "paradox".
Once they are emptied and upgraded they will of course be put on the
newer version of the hypervisor. It is expected that customer
services we move from these servers will be put on servers that
already have the newer hypervisor version.
Thank you for your patience and I apologise for the disruption. I'm
doing all that I can to try to find a solution.
Thanks,
Andy
--
https://bitfolk.com/ -- No-nonsense VPS hosting
_______________________________________________
announce mailing list
announce(a)lists.bitfolk.com
https://lists.bitfolk.com/mailman/listinfo/announce
Hi,
There was some confusion recently related to whether the email
notification about running an open portmapper required action. The
notification email therefore needs to be improved. Some suggestions
are here:
https://tools.bitfolk.com/redmine/issues/197
If you have any comments, you can log in with your usual BitFolk
credentials to add them.
Cheers,
Andy
--
https://bitfolk.com/ -- No-nonsense VPS hosting
Hi,
This is your one week to go reminder that there will be scheduled
maintenance on four servers starting at 23:00 BST (22:00Z) on
Thursday 27 May 2021. The four servers affected will be:
- elephant.bitfolk.com
- limoncello.bitfolk.com
- snaps.bitfolk.com
- talisker.bitfolk.com
The maintenance window is 3 hours long but we expect the work to
take less than 30 minutes per server.
A direct email has also been sent out to contacts for all customers
on those servers.
If you cannot tolerate the ~30 minutes of downtime at that time
please reply to the email to open a support ticket asking for your
service to be moved about. That will take place at a time of your
choosing in the next week, but please ask early if you need it.
There are further details here:
https://tools.bitfolk.com/wiki/Maintenance/2021-05-Re-racking
Please note that we are still moving customers around due to rolling
software upgrades on our servers. That is unrelated to this work,
but right now customers are being moved off of snaps.bitfolk.com.
Possibly that server will be emptied before this work takes place.
The above wiki page tells you how to work out which server your
service is on.
Any other questions, please do let us know!
Cheers,
Andy
--
https://bitfolk.com/ -- No-nonsense VPS hosting
_______________________________________________
announce mailing list
announce(a)lists.bitfolk.com
https://lists.bitfolk.com/mailman/listinfo/announce
Hi,
TL;DR: There's 21 serious security vulnerabilities recently
published for the Exim mail server, 10 of which are remotely
triggerable. Anyone running Exim needs to patch it ASAP or risk
having their server automatically root compromised as soon as an
exploit is cooked up. Which may have happened already.
Details: https://lwn.net/Articles/855282/
We don't usually post about other vendors' security issues on the
announce@ list but I'm making an exception for this one because Exim
is installed by default on all versions of Debian, and more than
60% of BitFolk customers use some version of Debian.
If you're running Exim you need to upgrade it immediately. Package
updates have already been posted for Debian 9 and 10
(stretch/oldstable and buster/stable). The last time this sort of
thing happened with Exim several customers were automatically
compromised. As it's a root level compromise, if it happens to you
then you will never be sure what exactly what done to your server.
You might end up needing to reinstall it.
Most hosts, unless they are acting as a server listed in one or more
domains' MX records, do not need to be remotely accessible on port
25. If that's the case for you then you would be well advised to
reconfigure Exim to only listen on localhost. Though there are still
11 other vulnerabilities that local users could exploit. At least
you'd only get rooted by a friend, right?
An exploit hasn't been published yet but that doesn't mean that one
doesn't exist, and now that the source changes are public it should
be fairly easy for developers to work out how to do it.
Some of the bugs go back to 2004 so basically every Exim install is
at risk. If you are running a release of Debian prior to version 9
(stretch) then it's out of security support and may not ever see an
updated package for this, so you need to strongly consider turning
off any Exim server and doing an OS upgrade before you turn it back
on.
If you need help, you could reply to this and seek help from other
customers, or BitFolk can help you as a consultancy service, but you
probably don't want to pay consultancy prices and in any moderately
complicated setup our approach is going to be an OS upgrade anyway.
Email support(a)bitfolk.com to discuss if still interested in that.
Best of luck with the upgrading!
Andy
--
https://bitfolk.com/ -- No-nonsense VPS hosting
_______________________________________________
announce mailing list
announce(a)lists.bitfolk.com
https://lists.bitfolk.com/mailman/listinfo/announce
Hi,
Tonight from around 21:00 BST onwards we started getting alerts and
support tickets regarding customer services on host
jack.bitfolk.com.
I had a look and unfortunately it appears to be a re-occurrence of
previous issues regarding stalled IO:
https://lists.bitfolk.com/lurker/message/20210425.071102.9d9a1cc5.en.htmlhttps://lists.bitfolk.com/lurker/message/20210220.032844.00dc9600.en.htmlhttps://lists.bitfolk.com/lurker/message/20201116.003514.25278824.en.html
Things were largely unresponsive so I had to forcibly reboot the
server. Customer services were all booted or in the process of
booting by about 21:53.
What we know so far:
- It must be a software issue, not a hardware issue, as it's
happened on multiple servers of different specifications.
- It's only happening with servers that we've upgraded to Xen
version 4.12.x.
- It's going to be really difficult to track down because there can
be months between occurrences.
Each time this has occurred I've made some change that I'd hoped
would lead to a solution, but I've now tried all the easy things and
so all that remains is to do another software upgrade.
I think we're going to have to build some packages for Xen 4.14 and
install that on a test host and see how it goes. The difficulty is
that once I do that and it seems to work, we'll never really know
because it could just be in the long period of time where the
problem is not triggered. Clearly once we have seemingly-working
packages we can't leave them spinning for 6 months just to reassure
ourselves of that.
I also am unsure about whether it is a good idea to force additional
downtimes on you in order to upgrade servers to 4.14.x when I don't
even know yet if that will fix the issue. What I can do is have the
upgrade ready and then if/when the issue re-occurs do the upgrade
then, so it boots into that.
Anyway, all I can say is that this is a really unfortunate state of
affairs that obviously I'm not happy with and I'm doing all that I
can to resolve it. These outages are unacceptable and rest assured
they are aggravating me more than anyone else.
Thanks,
Andy Smith
BitFolk Ltd
--
https://bitfolk.com/ -- No-nonsense VPS hosting
_______________________________________________
announce mailing list
announce(a)lists.bitfolk.com
https://lists.bitfolk.com/mailman/listinfo/announce
Hi,
I'm sorry. There are problems with the host and VMs on jack will be
unresponsive or shutting down. I'm working on it.
Cheers,
Andy
_______________________________________________
announce mailing list
announce(a)lists.bitfolk.com
https://lists.bitfolk.com/mailman/listinfo/announce
Hi,
At approximately 17:28 BST we started receiving numerous alerts for
server "macallan" and customer services on it. Upon investigation I
was unable to connect to the IPMI console of the server.
I got in contact with the colo provider who quickly realised that
they were doing work in that rack and had knocked out the power
cable for this server.
The server started booting around 17:35 and all customer VMs had
booted by 17:47.
We use locking power cables in our servers to try to minimise this
sort of thing, but they only lock at one end - the server end. The
server's power cord had come loose at the other end.
"macallan" is one of our older servers which is single power supply
unit. To mitigate that risk it plugs into a automatic transfer
switch so that its single PSU continues to receive power even if one
of the rack's two PDUs or power feeds fails. Unfortunately that does
not protect it against its single power cord coming out of the ATS.
We have started a hardware refresh and the new spec servers do have
dual PSUs which should help to avoid things like this in future.
Please accept my apologies for this disruption.
Andy Smith
BitFolk Ltd
--
https://bitfolk.com/ -- No-nonsense VPS hosting
_______________________________________________
announce mailing list
announce(a)lists.bitfolk.com
https://lists.bitfolk.com/mailman/listinfo/announce