BitFolk Announcements

announce@mailman.bitfolk.com

1 participants
265 discussions

2021-06-09 ~23:33Z - 2021-06-10 ~00:16Z: Emergency reboot of server "clockwork"

by Andy Smith

Hi, At around 00:33 BST (23:33Z) we started to receive alerts regarding services on host "clockwork". Upon investigation it was showing all the signs of being the intermittent "frozen I/O" problem we've been having: https://lists.bitfolk.com/lurker/message/20210425.071102.9d9a1cc5.en.html As mentioned in that earlier email, I'd decided that the next step would be prepare new hypervisor packages and I did do that the next day. As this issue seems to happen only every few months and on different servers we do not yet know if the new packages fix the problem. They've been in use on other servers since late April without incident, but that isn't yet proof enough given the long periods between occurrences. Anyway, after "clockwork" was power cycled the new packages were installed there and then all VMs were started again. This was completed by about 01:16 BST (00:16Z). There are still many of our servers where we know this is going to happen again at some point. I don't feel comfortable scheduling maintenance to upgrade them when I don't know if the upgrade will be effective. If we can go a significant period of time on the newer version without incident then we will schedule a maintenance window to get the remaining servers on those versions too. It is also possible that there will be a security patch that forces a maintenance, in which case we'll upgrade the hypervisor packages to the newer version at the same time. There are also some servers still left to be emptied so that their operating systems can be upgraded. Those are "hen" and "paradox". Once they are emptied and upgraded they will of course be put on the newer version of the hypervisor. It is expected that customer services we move from these servers will be put on servers that already have the newer hypervisor version. Thank you for your patience and I apologise for the disruption. I'm doing all that I can to try to find a solution. Thanks, Andy -- https://bitfolk.com/ -- No-nonsense VPS hosting

3 years, 5 months

Reminder: there will be scheduled maintenance on 2021-05-27 affecting customers on four servers

by Andy Smith

Hi, This is your one week to go reminder that there will be scheduled maintenance on four servers starting at 23:00 BST (22:00Z) on Thursday 27 May 2021. The four servers affected will be: - elephant.bitfolk.com - limoncello.bitfolk.com - snaps.bitfolk.com - talisker.bitfolk.com The maintenance window is 3 hours long but we expect the work to take less than 30 minutes per server. A direct email has also been sent out to contacts for all customers on those servers. If you cannot tolerate the ~30 minutes of downtime at that time please reply to the email to open a support ticket asking for your service to be moved about. That will take place at a time of your choosing in the next week, but please ask early if you need it. There are further details here: https://tools.bitfolk.com/wiki/Maintenance/2021-05-Re-racking Please note that we are still moving customers around due to rolling software upgrades on our servers. That is unrelated to this work, but right now customers are being moved off of snaps.bitfolk.com. Possibly that server will be emptied before this work takes place. The above wiki page tells you how to work out which server your service is on. Any other questions, please do let us know! Cheers, Andy -- https://bitfolk.com/ -- No-nonsense VPS hosting

3 years, 5 months

2021-05-15 ~21:00 BST (20:00Z) onwards forced reboot of host "jack"

by Andy Smith

Hi, Tonight from around 21:00 BST onwards we started getting alerts and support tickets regarding customer services on host jack.bitfolk.com. I had a look and unfortunately it appears to be a re-occurrence of previous issues regarding stalled IO: https://lists.bitfolk.com/lurker/message/20210425.071102.9d9a1cc5.en.html https://lists.bitfolk.com/lurker/message/20210220.032844.00dc9600.en.html https://lists.bitfolk.com/lurker/message/20201116.003514.25278824.en.html Things were largely unresponsive so I had to forcibly reboot the server. Customer services were all booted or in the process of booting by about 21:53. What we know so far: - It must be a software issue, not a hardware issue, as it's happened on multiple servers of different specifications. - It's only happening with servers that we've upgraded to Xen version 4.12.x. - It's going to be really difficult to track down because there can be months between occurrences. Each time this has occurred I've made some change that I'd hoped would lead to a solution, but I've now tried all the easy things and so all that remains is to do another software upgrade. I think we're going to have to build some packages for Xen 4.14 and install that on a test host and see how it goes. The difficulty is that once I do that and it seems to work, we'll never really know because it could just be in the long period of time where the problem is not triggered. Clearly once we have seemingly-working packages we can't leave them spinning for 6 months just to reassure ourselves of that. I also am unsure about whether it is a good idea to force additional downtimes on you in order to upgrade servers to 4.14.x when I don't even know yet if that will fix the issue. What I can do is have the upgrade ready and then if/when the issue re-occurs do the upgrade then, so it boots into that. Anyway, all I can say is that this is a really unfortunate state of affairs that obviously I'm not happy with and I'm doing all that I can to resolve it. These outages are unacceptable and rest assured they are aggravating me more than anyone else. Thanks, Andy Smith BitFolk Ltd -- https://bitfolk.com/ -- No-nonsense VPS hosting

3 years, 6 months

Problems with host "jack"

by Andy Smith

Hi, I'm sorry. There are problems with the host and VMs on jack will be unresponsive or shutting down. I'm working on it. Cheers, Andy

3 years, 6 months

2021-05-15 ~17:28 BST (16:28Z) - Unscheduled power cycle of server "macallan"

by Andy Smith

Hi, At approximately 17:28 BST we started receiving numerous alerts for server "macallan" and customer services on it. Upon investigation I was unable to connect to the IPMI console of the server. I got in contact with the colo provider who quickly realised that they were doing work in that rack and had knocked out the power cable for this server. The server started booting around 17:35 and all customer VMs had booted by 17:47. We use locking power cables in our servers to try to minimise this sort of thing, but they only lock at one end - the server end. The server's power cord had come loose at the other end. "macallan" is one of our older servers which is single power supply unit. To mitigate that risk it plugs into a automatic transfer switch so that its single PSU continues to receive power even if one of the rack's two PDUs or power feeds fails. Unfortunately that does not protect it against its single power cord coming out of the ATS. We have started a hardware refresh and the new spec servers do have dual PSUs which should help to avoid things like this in future. Please accept my apologies for this disruption. Andy Smith BitFolk Ltd -- https://bitfolk.com/ -- No-nonsense VPS hosting

3 years, 6 months

21 critical Exim security issues need addressing

by Andy Smith

Hi, TL;DR: There's 21 serious security vulnerabilities recently published for the Exim mail server, 10 of which are remotely triggerable. Anyone running Exim needs to patch it ASAP or risk having their server automatically root compromised as soon as an exploit is cooked up. Which may have happened already. Details: https://lwn.net/Articles/855282/ We don't usually post about other vendors' security issues on the announce@ list but I'm making an exception for this one because Exim is installed by default on all versions of Debian, and more than 60% of BitFolk customers use some version of Debian. If you're running Exim you need to upgrade it immediately. Package updates have already been posted for Debian 9 and 10 (stretch/oldstable and buster/stable). The last time this sort of thing happened with Exim several customers were automatically compromised. As it's a root level compromise, if it happens to you then you will never be sure what exactly what done to your server. You might end up needing to reinstall it. Most hosts, unless they are acting as a server listed in one or more domains' MX records, do not need to be remotely accessible on port 25. If that's the case for you then you would be well advised to reconfigure Exim to only listen on localhost. Though there are still 11 other vulnerabilities that local users could exploit. At least you'd only get rooted by a friend, right? An exploit hasn't been published yet but that doesn't mean that one doesn't exist, and now that the source changes are public it should be fairly easy for developers to work out how to do it. Some of the bugs go back to 2004 so basically every Exim install is at risk. If you are running a release of Debian prior to version 9 (stretch) then it's out of security support and may not ever see an updated package for this, so you need to strongly consider turning off any Exim server and doing an OS upgrade before you turn it back on. If you need help, you could reply to this and seek help from other customers, or BitFolk can help you as a consultancy service, but you probably don't want to pay consultancy prices and in any moderately complicated setup our approach is going to be an OS upgrade anyway. Email support(a)bitfolk.com to discuss if still interested in that. Best of luck with the upgrading! Andy -- https://bitfolk.com/ -- No-nonsense VPS hosting

3 years, 6 months

Reminder: major scheduled maintenance starting in about 20 minutes

by Andy Smith

Hello, A reminder that the greater part of the scheduled maintenance is happening shortly at 23:00 BST (22:00Z): https://tools.bitfolk.com/wiki/Maintenance/2021-04-Re-racking You can follow along the progress on Twitter: https://twitter.com/bitfolk though there's a lot of work so it may be less verbose than usual. Cheers, Andy -- https://bitfolk.com/ -- No-nonsense VPS hosting

3 years, 6 months

Emergency reboot of jack.bitfolk.com, 2021-04-25 ~06:47Z

by Andy Smith

Hi, After receiving a number of alerts for VMs hosted on server "jack", I investigated and found the server largely unresponsive. Unfortunately I had no option but to forcibly reboot it, which I did at about 06:47Z It's now 07:01Z and monitoring says everything is back up, except for two customer VMs which are waiting for a LUKS passphrase on their console. This problem was the same as what was experienced with some of the other servers a few months ago. With the months-long gap I had hoped it was some undiagnosed kernel issue which we had got past, but apparently not, as "jack" is on the latest available kernel package. I'm pursuing some ideas about a config change that may help, and I managed to put that into place before "jack" was rebooted - it does require a reboot so if it does help it won't be able to take effect on the others until next reboot. On the other hand it doesn't hurt either, so I've made the same change elsewhere also. If that doesn't fix things then the next line of investigation will be an upgrade of the hypervisor to latest stable release, though that is a rather major undertaking. Apologies for the disruption. It is challenging to debug a problem that can take several months to occur, with no reliable way of triggering it. :( Thanks, Andy -- https://bitfolk.com/ -- No-nonsense VPS hosting

3 years, 6 months

Further relocation of servers, likely to be evening of Thursday 27 May

by Andy Smith

Hi, We don't have 100% firm date for this yet and I don't want to announce it properly while the other moves are happening but I also want to try to give a month of notice: The servers that are not getting relocated today & 27 April will very likely be relocated on 27 May. That is: - elephant - limoncello - snaps - talisker Now because of the movement of individual customers that is still taking place, it is very likely that everyone who's currently on "snaps" will be moved off of it before then, in which case we will relocate it at a time convenient to us as it won't affect anyone. The other three have already had an OS upgrade so they definitely will be relocated all together on the same night, which as I say is very likely to be 27 May. As before I will send another mail here when we have confirmation and then we will send a direct email a week before to everyone who will be affected. And as before it will be possible to have your service moved about ahead of time at a time of your choosing. While I'm here a reminder that: - Server "hen" is being relocated tonight at some point between 21:00 and 23:00 BST (20:00 to 22:00 UTC). Those affected would have received individual notifications about that a week ago. - Servers "clockwork", "hobgoblin", "jack", "leffe", "macallan" and "paradox" will be relocated on 27 April and later today all those who will be affected will receive a direct email about this. The full details of those works are at: https://tools.bitfolk.com/wiki/Maintenance/2021-04-Re-racking Cheers, Andy -- https://bitfolk.com/ -- No-nonsense VPS hosting

3 years, 7 months

Major maintenance scheduled for 2021-04-20 (1 server) and 2021-04-27 (another 6 servers)

by Andy Smith

Hello all, A web version of this email with any updates that have come to light since posting is available here: https://tools.bitfolk.com/wiki/Maintenance/2021-04-Re-racking == TL;DR: We need to relocate some servers to a different rack within Telehouse. On Tuesday 20 April 2021 at some point in the 2 hour window starting at 20:00Z (21:00 BST) all customers on the following server will have their VMs either powered off or suspended to storage: * hen.bitfolk.com We expect to have it powered back on within 30 minutes. On Tuesday 27 April 2021 at some point in the 4 hour window starting at 22:00Z (23:00 BST) all customers on the following servers will have their VMs either powered off or suspended to storage: * clockwork.bitfolk.com * hobgoblin.bitfolk.com * jack.bitfolk.com * leffe.bitfolk.com * macallan.bitfolk.com * paradox.bitfolk.com We expect the work on each server to take less than 30 minutes. See "Frequently Asked Questions" at the bottom of this email for how to determine which server your VM is on. If you can't tolerate a ~30 minute outage at these times then please contact support as soon as possible to ask for your VM to be moved to a server that won't be part of this maintenance. == Maintenance Background Our colo provider needs to rebuild one of their racks that houses 7 of our servers. This is required because the infrastructure in the rack (PDUs, switches etc) is of a ten year old vintage and all needs replacing. To facilitate this, all customer hardware in that rack will need to be moved to a different rack or sit outside of the rack while it is rebuilt. We are going to have to move our 7 servers to a different rack. This is a significant piece of work which is going to affect several hundred of our customers, more than 70% of the customer base. Unfortunately it is unavoidable. == Networking upgrade We will also take the opportunity to install 10 gigabit NICs in the servers which are moved. The main benefit of this will be faster inter-server data transfer for when we want to move customer services about. The current 1GE NICs limit this to about 90MiB/sec. == Suspend & Restore If you opt in to suspend & restore then instead of shutting your VM down we will suspend it to storage and then when the server boots again it will be restored. That means that you should not experience a reboot, just a period of paused execution. You may find this less disruptive than a reboot, but it is not without risk. Read more here: https://tools.bitfolk.com/wiki/Suspend_and_restore == Avoiding the Maintenance If you cannot tolerate a ~30 minute outage during the maintenance windows listed above then please contact support to agree a time when we can move your VM to a server that won't be part of the maintenance. Doing so will typically take just a few seconds plus the time it takes your VM to shut down and boot again and nothing will change about your VM. If you have opted in to suspend & restore then we'll use this to do a "semi-live" migration. This will appear to be a minute or two of paused execution. Moving your VM is extra work for us which is why we're not doing it by default for all customers, but if you prefer that to experiencing the outage then we're happy to do it at a time convenient to you, as long as we have time to do it and available spare capacity to move you to. If you need this then please ask as soon as possible to avoid disappointment. It won't be possible to change the date/time of the planned work on an individual customer basis. This work involves 7 of our servers, will affect several hundred of our customers, and also has needed to be scheduled with our colo provider and some of their other customers. The only per-customer thing we may be able to do is move your service ahead of time at a time convenient to you. == Rolling Upgrades Confusion We're currently in a cycle of rolling software upgrades to our servers. Many of you have already received individual support tickets to schedule that. It involves us moving your VM from one of our servers to another and full details are given in the support ticket. This has nothing to do with the maintenance that's under discussion here and we realise that it's unfortunately very confusing to have both things happening at the same time. We did not know that moving our servers would be necessary when we started the rolling upgrades. We believe we can avoid moving any customer from a server that is not part of this maintenance onto one that will be part of this maintenance. We cannot avoid moving customers between servers that are both going to be affected by this maintenance. For example, at the time of writing, customer services are being moved off of jack.bitfolk.com and most of them will end up on hobgoblin.bitfolk.com. == Further Notifications Every customer is supposed to be subscribed to this announcement mailing list, but no doubt some aren't. The movement of customer services between our servers may also be confusing for people, so we will send a direct email notification to the main contact of affected customers a week before the work is due to take place. So, on Tuesday 13 April we'll send a direct email about this to customers that are hosted on hen.bitfolk.com, and then on Tuesday 20 April we'll send a similar email to customers on all the rest of the affected servers. == 20 April Will Be a Test Run We are only planning to move one server on 20 April. The reasons for this are: * We want to check our assumptions about how long this work will take, per server. * We're changing the hardware configuration of the server by adding 10GE NICs, and we want to make sure that configuration is stable. The timings for the maintenance on 27 April may need to be altered if the work on 20 April shows our guesses to be wildly wrong. == Frequently Asked Questions === How do I know if I will be affected? If your VM is hosted on one of the servers that will be moved then you are going to be affected. There's a few different ways that you can tell which server you are on: 1. It's listed on https://panel.bitfolk.com/ 2. It's in DNS when you resolve <youraccountname>.console.bitfolk.com 3. It's on your data transfer email summaries 4. You can see it on a `traceroute` or `mtr` to or from your VPS. === If you can "semi-live" migrate VMs, why don't you just do that? * This maintenance will involve some 70% of our customer base, so we don't actually have enough spare hardware to move customers to. * Moving the data takes significant time at 1GE network speeds. For these reasons we think that it will be easier for most customers to just accept a ~30 minute outage. Those who can't tolerate such a disruption will be able to have their VMs moved to servers that aren't going to be part of the maintenance. === Why are you needing to test out adding 10GE NICs to a live server? Isn't it already tested? The main reason for running through this process first on one server only (hen) is to check timings and procedure before doing it on another six servers all at once. The issue of installing 10GE NICs is a secondary concern and considered low risk. The hardware for all 7 of the servers that are going to be moved is obsolete now, so it's not possible to obtain identical spares now. The 10GE NICs have been tested in general, but not with this specific hardware, so it's just an extra cautionary measure. The 10GE NICs will not be in use immediately in order to avoid too much change at once, but this still does involve plugging in a PCIe card which on boot will load a kernel module so while the risk is considered low, it's not zero. === Further questions? If there's anything we haven't covered or you need clarified please do ask here or privately to support. -- https://bitfolk.com/ -- No-nonsense VPS hosting

3 years, 8 months

← Newer
1
2
3
4
5
6
7
8
...
27
Older →

Jump to page:

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

BitFolk Announcements