BitFolk Users

users@mailman.bitfolk.com

1123 discussions

We need to make some decisions about suspend+restore

by Andy Smith

Hello, TL;DR: There were some serious problems with suspend+restore during the last maintenance and 2 customer VMs were shredded. It looks like that was due to a kernel bug on the guest side which was fixed in Linux v4.2 but until we can test that hypothesis we won't be doing any more suspend+restore. If that holds true then we have to decide how/if to reintroduce the feature. Unless you have an interest in this topic you can ignore this email. There's nothing you can/should do at this stage - though regardless of this you should keep your operating system up to date of course. Detailed version follows: As you're probably aware, we allow customers to opt in to something we call "suspend and restore". There's more info about that here: https://tools.bitfolk.com/wiki/Suspend_and_restore A summary is: if you opt in to it then any time we need to shut the bare metal host down or move your VM between bare metal hosts, we save your VM's state to storage and then restore it again afterwards. The effect seems like a period of paused execution, so everything remains running and often even TCP sessions will remain alive. It's a lot less disruptive than a shutdown and boot. It hasn't always worked perfectly. Sometimes, usually with older kernels, what is restored doesn't work properly. It locks up or spews error messages on the console and has to be forcibly terminated and then booted. These problems have tended to be deterministic, i.e. either your kernel has problems or it's fine, and this is repeatable, so when someone has had problems we've advised them to opt back out of suspend+restore. Also when there have been problems it hasn't been destructive. I've never seen the on-boot fsck say more than "unclean shutdown". So I've always regarded suspend+restore to be pretty safe. All BitFolk infrastructure VMs use suspend+restore; this is currently 35 VMs. I've done it several hundred times at this point and before this week the failure rate was very low and failure mode not too terrible, Suspend+restore was used during the maintenance this week on the first two servers, "clockwork" and "hobgoblin", for customers who opted in (and BitFolk stuff). There was nothing to indicate problems with the 10 VMs on "clockwork" that were suspended and restored. Of the 17 VMs on "hobgoblin" that had opted in, two of them failed to restore and were unresponsive. Destroy and boot was the only option left, and then it was discovered that they had both suffered serious filesystem corruption which was not recoverable. Losing customer data is the worst. It hasn't happened in something like 8 years and even then that was down to (my) human error not a bug in the software we use to provide the service. After having that happen we did not honour requests to suspend+restore for the remainder of the maintenance on other servers and the work proceeded without incident other than that. We've tracked this bug down to this: https://lore.kernel.org/lkml/1437449441-2964-1-git-send-email-bob.liu@oracl… This is fixing a bug in the Linux kernel on the guest side. It's in the block device driver that Xen guests use. It made its way upstream in v4.2-rc7 of the Linux kernel. If I understand it correctly it's saying that across a migration (a suspend+restore is a migration on the same host) this change is necessary for a guest to notice if a particular feature/capability has changed, and act accordingly. So, without the patch there can be a mismatch between what the backend driver on the bare metal host and the frontend driver in your VM understand about the protocol they use, and that is what caused the corruption for these two VMs. It is believed that this has never been seen before because we only recently upgraded our host kernels from 4.19 to 5.10, which is using some new features. There hasn't been any suspending and restoring happening with those newer host kernels until this last week. Though I did test that, but admittedly not with any EOL guest operating systems. The two VMs were running obsolete kernels: A Debian 8 (jessie) 3.16 kernel and a CentOS 6 2.6.32 kernel. These kernels are never going to receive official backports of that patch because they're out of support by their distributors. Since reporting this issue another person has told me that they've now tested migration with Debian 8 guests and it breaks every time for them, sometimes with disastrous circumstances as we have seen. I am at this time unable to explain why several other customer VMs did not experience this calamity, though I am of course glad that they did not. Out of the 27, many were running kernels as old as this or even older, but only 2 exploded like this. So what are we going to do about it? I think initially we are of course not going to be able to use suspend+restore any more even if you have opted in to it. We just won't honour that setting for now. Meanwhile I think I will test the hypothesis that it's okay with guest kernels newer than 4.2. Obviously if it's not then the feature remains disabled. But assuming I cannot replicate this issue with kernels that have the fix, then we have to decide if and how we're going to reintroduce this feature. The rest of this email assumes that guest kernels of 4.2+ are okay, in which case I am minded to: 1. Reset everyone back to opting out 2. Add a warning on the opt in bit that says it mustn't be used with kernels older than 4.2 because of this known and serious bug 3. Post something on the announce list saying to opt back in again if you want (with details of what's been tested etc.) We can't easily tell what kernels people are running so we don't have the option of disabling it just for those running older kernels. There are 85 customer VMs that have opted in to suspend+restore. When the time comes we can perhaps do some testing for people who are interested in re-enabling that but want more reassurance. I can imagine that this testing would take the form of: 1. Snapshot your storage while your VM is running 2. Suspend+restore your VM. 3. If it works, great. If it explodes then we rollback your snapshot, which would take a couple of minutes. This would appear to your operating system to be like a power off or abrupt crash. Which is something that all software should be robust against, but occasionally isn't. I'm undecided on whether it will be worth sending a direct email to the contacts of those 85 VMs with the background info and offer of this testing or whether just posting something to the announce list will be enough. If you are running a kernel this old then I would of course always recommend an upgrade or reinstall anyway, regardless of this. You don't have any security support in your kernel and it's at least 6 years since its release. On Debian jessie it is fairly easy to just install the jessie-backports kernel (a 4.9 kernel): https://www.lucas-nussbaum.net/blog/?p=947 Summary: # echo 'Acquire::Check-Valid-Until no;' > /etc/apt/apt.conf.d/99no-check-valid-until # echo 'deb http://archive.debian.org/debian/ jessie-backports main' > /etc/apt/sources.list.d/backports.list # apt update # apt install linux-image-amd64/jessie-backports or: # apt install linux-image-686/jessie-backports Doing this will (a) disable checking of the validity of all your package sources, and (b) still leave you on a kernel that has no security support. But, you already were in that position. No one needs to take any action at this stage because we just won't be doing any more suspend+restore until we know more, and probably the next step after that will be to opt everyone back out and tell you that. Your comments and thoughts on all of this are welcome. Cheers, Andy -- https://bitfolk.com/ -- No-nonsense VPS hosting

3 years, 7 months

Reboots will be necessary to address security issues, probably early hours 24+25 August 2021

by Andy Smith

Hello, Unfortunately some serious security bugs have been discovered in the Xen hypervisor and fixes for these have now been pre-disclosed, with an embargo that ends at 1200Z on 25 August 2021. As a result we will need to apply these fixes and reboot everything before that time. We are likely to do this in the early hours of the morning UK time, on Tuesday 24 and Wednesday 25 August. In the next few days individual emails will be sent out confirming to you which hour long maintenance window your services are in. The times will be in UTC; please note that UK is currently observing daylight savings and as such is currently at UTC+1. We expect the work to take between 15 and 45 minutes per bare metal host. We are going to take the opportunity to complete upgrading the kernel and hypervisor on some of the hosts that haven't had that done yet, which is why the work may take a few minutes more for some hosts. There are two hosts left that we are trying to migrate customers off of ("hen" and "paradox"). That was supposed to be done by now but that effort has been hampered by the other issues we've been having and is dragging on. We don't intend to patch or reboot those two hosts, instead mitigating issues with configuration and renewing efforts to clear customers off of them. If you are concerned about that we will be happy to move your service as a priority. If you have opted in to suspend and restore¹ then your VM will be suspended to storage and restored again after the host it is on is rebooted. Otherwise your VM will be cleanly shut down and booted again later. If you cannot tolerate the downtime then please contact support(a)bitfolk.com. We will be able to migrate² you to already-patched hardware before the regular maintenance starts, at a time of your choosing. You can expect a few tens of seconds of pausing in that case. This process uses suspend&restore so has the same caveats. Thanks, Andy ¹ https://tools.bitfolk.com/wiki/Suspend_and_restore ² https://tools.bitfolk.com/wiki/Suspend_and_restore#Migration -- https://bitfolk.com/ -- No-nonsense VPS hosting _______________________________________________ announce mailing list announce(a)lists.bitfolk.com https://lists.bitfolk.com/mailman/listinfo/announce

3 years, 7 months

Debian testing (bookworm) now available for self-install

by Andy Smith

Hi, I've switched the debian_testing install target over to bookworm and tried it out. I've tested it in amd64 PVH mode only at the moment, and it worked. It is however pretty much identical to bullseye at the moment (even down to the login prompt and /etc/debian_release file, which still say Debian 11). You will need to be seeing v1.48bitfolk65 of the Xen Shell to have it work. Any earlier version will end up installing Debian 11 (bullseye). Cheers, Andy -- https://bitfolk.com/ -- No-nonsense VPS hosting

3 years, 7 months

Broken Courier SSL after upgrade from stretch and crossgrade to AMD64

by Dave Mills

Hello All, Throwing this problem out into the wild to see if anyone has any ideas - already had some help from Andy re-crossgrading so trying not to bother him with all my woes. I've upgraded to Buster from Stretch (yep, I know I'm lagging). Courier seems to have decided to be my SMTP server instead of Exim and I can't get any connections to either IMAP, IMAP-SSL or SMTP from the rest of the world. Asfar as I can see courier is running and should be trying to do IMAP type things. Courier doesn't seem to log to its own logfiles, so what it is doing is being logged to syslog along with a mess of other stuff. Neither thunderbird or apple mail give me anything useful on the attempt to connect. Just that the connection failed. Does anyone have any good ideas? (Note to self - go back in time about 4 years and document your mail-server) cheers, David If man has no tea in him, he is incapable of understanding truth and beauty. ~Japanese Proverb Find yourself a cup of tea; the teapot is behind you. Now tell me about hundreds of things. ~Saki

3 years, 7 months

Debian 11 (bullseye) now supported for self-install and new orders

by Andy Smith

Hi, The new stable release of Debian, bullseye, was released over the weekend: https://bits.debian.org/2021/08/bullseye-released.html This is now supported for self-install: https://tools.bitfolk.com/wiki/Using_the_self-serve_net_installer#Debian by doing "install debian_bullseye", and also of course as a new order. If upgrading in-place from buster to bullseye please make sure to read the release notes as there are a few things to be aware of: https://www.debian.org/releases/stable/amd64/release-notes/ch-upgrading.en.… There aren't any known BitFolk-specific issues with bullseye, though we do suggest that if you're running buster or beyond that you do so in PVH mode: https://tools.bitfolk.com/wiki/PVH I *think* it is still possible to install the new testing release (bookworm) by doing "install debian_testing" but we have to check that and fix it if necessary, which will happen later this week. If you're desperate to do a clean install of Debian testing and you find that "install debian_testing" doesn't work then I recommend installing bullseye and doing an in-place upgrade from there (will be almost identical right now). Cheers, Andy -- https://bitfolk.com/ -- No-nonsense VPS hosting _______________________________________________ announce mailing list announce(a)lists.bitfolk.com https://lists.bitfolk.com/mailman/listinfo/announce

3 years, 7 months

Routing issues?

by Mike Zanker

Are there any known routing issues at the moment? I'm unable to ping my VM on macallan over IPv4 but can over IPv6. There also appears to be a lot of packet loss over an ICMP traceroute, too. I also have an MRTG script running every 5 minutes on my VM that pings an AAISP-provided IP address, and that's failing even though the host is up. Cheers, Mike

3 years, 8 months

2021-07-19 ~19:30Z - ~20:45Z: Emergency reboot of server "limoncello"

by Andy Smith

Hi, At about 19:30Z we started receiving alerts for customer services on server "limoncello". On investigation it quickly became apparent that this was the intermittent "I/O stall" problem we've been seeing on all servers and have been grappling with for months now. All I could do was power cycle the server. My current line of investigation is to upgrade both the hypervisor and the kernel when this happens, and so far it hasn't reoccurred on any of the servers where that has been done, though the sometimes months long gap between incidents means it's not possible to be sure. Although this last happened 16 days ago, that was on a different server ("jack"). With the upgrades done the server was rebooted again and at about 20:28Z customer VMs started booting again. This was complete by about 20:45Z. Cheers, Andy -- https://bitfolk.com/ -- No-nonsense VPS hosting _______________________________________________ announce mailing list announce(a)lists.bitfolk.com https://lists.bitfolk.com/mailman/listinfo/announce

3 years, 8 months

2021-07-03 ~05:23Z - ~06:08Z: Emergency reboot of server "jack"

by Andy Smith

Hi, At about 05:23Z we started receiving alerts for customer services on server "jack". There had been some alerts for about 40 minutes before that, but they weren't serious enough to send push notifications, only emails. On investigation it quickly became apparent that this was the intermittent "I/O stall" problem we've been seeing on all servers and have been grappling with for months now. All I could do was power cycle the server, which happened at about 05:30Z. My current line of investigation is to upgrade both the hypervisor and the kernel when this happens, and so far it hasn't reoccurred on any of the servers where that has been done, though the sometimes months long gap between incidents means it's not possible to be sure. With the upgrades done, the server was rebooted again and at about 05:54Z customer VMs started booting again. This was complete by about 06:08Z. Cheers, Andy -- https://bitfolk.com/ -- No-nonsense VPS hosting _______________________________________________ announce mailing list announce(a)lists.bitfolk.com https://lists.bitfolk.com/mailman/listinfo/announce

3 years, 9 months

2021-06-19 ~20:37Z - 2021-06-19 ~21:3Z: Emergency reboot of server "hobgoblin"

by Andy Smith

Hi, At about 20:37Z we started receiving alerts for customer services on server "hobgoblin". It quickly became apparent that this was the intermittent "I/O stall" problem we've been seeing on all servers and have been grappling with for months now. All I could do was power cycle the server, which happened at about 20:58Z. We're still not able to reproduce the problem on demand and it can be several months between incidents. We've tried upgrading hypervisor and that's not helped. It's looking more like a problem in the Linux kernel. So, I upgraded that as well to a newer self-made package. I've been communicating with a couple of the linux-raid devs and we have some ideas but gathering information and making changes is going slowly because of the lack of reproducibility and long time between incidents. It's basically a case of making a single change any time there is an issue. With the upgrades done, the server was rebooted again and at about 21:19Z customer VMs started booting again. This was complete by about 21:33Z. Obviously I am not happy with these outages and I'm doing everything I can to find the root cause. Cheers, Andy -- https://bitfolk.com/ -- No-nonsense VPS hosting _______________________________________________ announce mailing list announce(a)lists.bitfolk.com https://lists.bitfolk.com/mailman/listinfo/announce

3 years, 9 months

2021-06-09 ~23:33Z - 2021-06-10 ~00:16Z: Emergency reboot of server "clockwork"

by Andy Smith

Hi, At around 00:33 BST (23:33Z) we started to receive alerts regarding services on host "clockwork". Upon investigation it was showing all the signs of being the intermittent "frozen I/O" problem we've been having: https://lists.bitfolk.com/lurker/message/20210425.071102.9d9a1cc5.en.html As mentioned in that earlier email, I'd decided that the next step would be prepare new hypervisor packages and I did do that the next day. As this issue seems to happen only every few months and on different servers we do not yet know if the new packages fix the problem. They've been in use on other servers since late April without incident, but that isn't yet proof enough given the long periods between occurrences. Anyway, after "clockwork" was power cycled the new packages were installed there and then all VMs were started again. This was completed by about 01:16 BST (00:16Z). There are still many of our servers where we know this is going to happen again at some point. I don't feel comfortable scheduling maintenance to upgrade them when I don't know if the upgrade will be effective. If we can go a significant period of time on the newer version without incident then we will schedule a maintenance window to get the remaining servers on those versions too. It is also possible that there will be a security patch that forces a maintenance, in which case we'll upgrade the hypervisor packages to the newer version at the same time. There are also some servers still left to be emptied so that their operating systems can be upgraded. Those are "hen" and "paradox". Once they are emptied and upgraded they will of course be put on the newer version of the hypervisor. It is expected that customer services we move from these servers will be put on servers that already have the newer hypervisor version. Thank you for your patience and I apologise for the disruption. I'm doing all that I can to try to find a solution. Thanks, Andy -- https://bitfolk.com/ -- No-nonsense VPS hosting _______________________________________________ announce mailing list announce(a)lists.bitfolk.com https://lists.bitfolk.com/mailman/listinfo/announce

3 years, 9 months

Jump to page:

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

BitFolk Users