BitFolk Announcements June 2021

announce@mailman.bitfolk.com

1 participants
2 discussions

2021-06-19 ~20:37Z - 2021-06-19 ~21:3Z: Emergency reboot of server "hobgoblin"

by Andy Smith

Hi, At about 20:37Z we started receiving alerts for customer services on server "hobgoblin". It quickly became apparent that this was the intermittent "I/O stall" problem we've been seeing on all servers and have been grappling with for months now. All I could do was power cycle the server, which happened at about 20:58Z. We're still not able to reproduce the problem on demand and it can be several months between incidents. We've tried upgrading hypervisor and that's not helped. It's looking more like a problem in the Linux kernel. So, I upgraded that as well to a newer self-made package. I've been communicating with a couple of the linux-raid devs and we have some ideas but gathering information and making changes is going slowly because of the lack of reproducibility and long time between incidents. It's basically a case of making a single change any time there is an issue. With the upgrades done, the server was rebooted again and at about 21:19Z customer VMs started booting again. This was complete by about 21:33Z. Obviously I am not happy with these outages and I'm doing everything I can to find the root cause. Cheers, Andy -- https://bitfolk.com/ -- No-nonsense VPS hosting

3 years, 7 months

2021-06-09 ~23:33Z - 2021-06-10 ~00:16Z: Emergency reboot of server "clockwork"

by Andy Smith

Hi, At around 00:33 BST (23:33Z) we started to receive alerts regarding services on host "clockwork". Upon investigation it was showing all the signs of being the intermittent "frozen I/O" problem we've been having: https://lists.bitfolk.com/lurker/message/20210425.071102.9d9a1cc5.en.html As mentioned in that earlier email, I'd decided that the next step would be prepare new hypervisor packages and I did do that the next day. As this issue seems to happen only every few months and on different servers we do not yet know if the new packages fix the problem. They've been in use on other servers since late April without incident, but that isn't yet proof enough given the long periods between occurrences. Anyway, after "clockwork" was power cycled the new packages were installed there and then all VMs were started again. This was completed by about 01:16 BST (00:16Z). There are still many of our servers where we know this is going to happen again at some point. I don't feel comfortable scheduling maintenance to upgrade them when I don't know if the upgrade will be effective. If we can go a significant period of time on the newer version without incident then we will schedule a maintenance window to get the remaining servers on those versions too. It is also possible that there will be a security patch that forces a maintenance, in which case we'll upgrade the hypervisor packages to the newer version at the same time. There are also some servers still left to be emptied so that their operating systems can be upgraded. Those are "hen" and "paradox". Once they are emptied and upgraded they will of course be put on the newer version of the hypervisor. It is expected that customer services we move from these servers will be put on servers that already have the newer hypervisor version. Thank you for your patience and I apologise for the disruption. I'm doing all that I can to try to find a solution. Thanks, Andy -- https://bitfolk.com/ -- No-nonsense VPS hosting

3 years, 8 months

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

BitFolk Announcements June 2021