I am a (delighted!) relatively new BF user and run two dozen websites under Centos and Virtualmin, with no email as I keep email off my webserver.
I am fed up with Cpanel in multiple ways and want to drop the server where I currently have all my email and mail forwarders.
Is another VPS on Centos with Virtualmin a good route to manage my and my clients’ email?
Or is there a better solution for a mail server?
TL;DR: There were some serious problems with suspend+restore during
the last maintenance and 2 customer VMs were shredded. It looks like
that was due to a kernel bug on the guest side which was fixed in
Linux v4.2 but until we can test that hypothesis we won't be doing
any more suspend+restore. If that holds true then we have to decide
how/if to reintroduce the feature. Unless you have an interest in
this topic you can ignore this email. There's nothing you can/should
do at this stage - though regardless of this you should keep your
operating system up to date of course.
Detailed version follows:
As you're probably aware, we allow customers to opt in to something
we call "suspend and restore". There's more info about that here:
A summary is: if you opt in to it then any time we need to shut the
bare metal host down or move your VM between bare metal hosts, we
save your VM's state to storage and then restore it again
afterwards. The effect seems like a period of paused execution, so
everything remains running and often even TCP sessions will remain
alive. It's a lot less disruptive than a shutdown and boot.
It hasn't always worked perfectly. Sometimes, usually with older
kernels, what is restored doesn't work properly. It locks up or
spews error messages on the console and has to be forcibly
terminated and then booted. These problems have tended to be
deterministic, i.e. either your kernel has problems or it's fine,
and this is repeatable, so when someone has had problems we've
advised them to opt back out of suspend+restore.
Also when there have been problems it hasn't been destructive. I've
never seen the on-boot fsck say more than "unclean shutdown". So
I've always regarded suspend+restore to be pretty safe.
All BitFolk infrastructure VMs use suspend+restore; this is
currently 35 VMs. I've done it several hundred times at this point
and before this week the failure rate was very low and failure mode
not too terrible,
Suspend+restore was used during the maintenance this week on the
first two servers, "clockwork" and "hobgoblin", for customers who
opted in (and BitFolk stuff). There was nothing to indicate problems
with the 10 VMs on "clockwork" that were suspended and restored.
Of the 17 VMs on "hobgoblin" that had opted in, two of them failed
to restore and were unresponsive. Destroy and boot was the only
option left, and then it was discovered that they had both suffered
serious filesystem corruption which was not recoverable.
Losing customer data is the worst. It hasn't happened in something
like 8 years and even then that was down to (my) human error not a
bug in the software we use to provide the service.
After having that happen we did not honour requests to
suspend+restore for the remainder of the maintenance on other
servers and the work proceeded without incident other than that.
We've tracked this bug down to this:
This is fixing a bug in the Linux kernel on the guest side. It's in
the block device driver that Xen guests use. It made its way
upstream in v4.2-rc7 of the Linux kernel.
If I understand it correctly it's saying that across a migration (a
suspend+restore is a migration on the same host) this change is
necessary for a guest to notice if a particular feature/capability
has changed, and act accordingly.
So, without the patch there can be a mismatch between what the
backend driver on the bare metal host and the frontend driver in
your VM understand about the protocol they use, and that is what
caused the corruption for these two VMs.
It is believed that this has never been seen before because we only
recently upgraded our host kernels from 4.19 to 5.10, which is using
some new features. There hasn't been any suspending and restoring
happening with those newer host kernels until this last week. Though
I did test that, but admittedly not with any EOL guest operating
The two VMs were running obsolete kernels: A Debian 8 (jessie) 3.16
kernel and a CentOS 6 2.6.32 kernel. These kernels are never going
to receive official backports of that patch because they're out of
support by their distributors.
Since reporting this issue another person has told me that they've
now tested migration with Debian 8 guests and it breaks every time
for them, sometimes with disastrous circumstances as we have seen.
I am at this time unable to explain why several other customer VMs
did not experience this calamity, though I am of course glad that
they did not. Out of the 27, many were running kernels as old as
this or even older, but only 2 exploded like this.
So what are we going to do about it?
I think initially we are of course not going to be able to use
suspend+restore any more even if you have opted in to it. We just
won't honour that setting for now.
Meanwhile I think I will test the hypothesis that it's okay with guest
kernels newer than 4.2. Obviously if it's not then the feature
remains disabled. But assuming I cannot replicate this issue with
kernels that have the fix, then we have to decide if and how we're
going to reintroduce this feature.
The rest of this email assumes that guest kernels of 4.2+ are okay,
in which case I am minded to:
1. Reset everyone back to opting out
2. Add a warning on the opt in bit that says it mustn't be used with
kernels older than 4.2 because of this known and serious bug
3. Post something on the announce list saying to opt back in again
if you want (with details of what's been tested etc.)
We can't easily tell what kernels people are running so we don't
have the option of disabling it just for those running older
kernels. There are 85 customer VMs that have opted in to
When the time comes we can perhaps do some testing for people who
are interested in re-enabling that but want more reassurance. I can
imagine that this testing would take the form of:
1. Snapshot your storage while your VM is running
2. Suspend+restore your VM.
3. If it works, great. If it explodes then we rollback your
snapshot, which would take a couple of minutes. This would appear
to your operating system to be like a power off or abrupt crash.
Which is something that all software should be robust against,
but occasionally isn't.
I'm undecided on whether it will be worth sending a direct email to
the contacts of those 85 VMs with the background info and offer of
this testing or whether just posting something to the announce list
will be enough.
If you are running a kernel this old then I would of course always
recommend an upgrade or reinstall anyway, regardless of this. You
don't have any security support in your kernel and it's at least 6
years since its release.
On Debian jessie it is fairly easy to just install the
jessie-backports kernel (a 4.9 kernel):
# echo 'Acquire::Check-Valid-Until no;' > /etc/apt/apt.conf.d/99no-check-valid-until
# echo 'deb http://archive.debian.org/debian/ jessie-backports main' > /etc/apt/sources.list.d/backports.list
# apt update
# apt install linux-image-amd64/jessie-backports
# apt install linux-image-686/jessie-backports
Doing this will (a) disable checking of the validity of all your
package sources, and (b) still leave you on a kernel that has no
security support. But, you already were in that position.
No one needs to take any action at this stage because we just won't
be doing any more suspend+restore until we know more, and probably
the next step after that will be to opt everyone back out and tell
Your comments and thoughts on all of this are welcome.
https://bitfolk.com/ -- No-nonsense VPS hosting
Unfortunately some serious security bugs have been discovered in the
Xen hypervisor and fixes for these have now been pre-disclosed, with
an embargo that ends at 1200Z on 25 August 2021.
As a result we will need to apply these fixes and reboot everything
before that time. We are likely to do this in the early hours of the
morning UK time, on Tuesday 24 and Wednesday 25 August.
In the next few days individual emails will be sent out confirming
to you which hour long maintenance window your services are in. The
times will be in UTC; please note that UK is currently observing
daylight savings and as such is currently at UTC+1.
We expect the work to take between 15 and 45 minutes per bare metal
host. We are going to take the opportunity to complete upgrading the
kernel and hypervisor on some of the hosts that haven't had that
done yet, which is why the work may take a few minutes more for some
There are two hosts left that we are trying to migrate customers off
of ("hen" and "paradox"). That was supposed to be done by now but
that effort has been hampered by the other issues we've been having
and is dragging on. We don't intend to patch or reboot those two
hosts, instead mitigating issues with configuration and renewing
efforts to clear customers off of them. If you are concerned about
that we will be happy to move your service as a priority.
If you have opted in to suspend and restore¹ then your VM will be
suspended to storage and restored again after the host it is on is
rebooted. Otherwise your VM will be cleanly shut down and booted
If you cannot tolerate the downtime then please contact
support(a)bitfolk.com. We will be able to migrate² you to
already-patched hardware before the regular maintenance starts, at a
time of your choosing. You can expect a few tens of seconds of
pausing in that case. This process uses suspend&restore so has the
https://bitfolk.com/ -- No-nonsense VPS hosting
announce mailing list
I've switched the debian_testing install target over to bookworm and
tried it out. I've tested it in amd64 PVH mode only at the moment,
and it worked.
It is however pretty much identical to bullseye at the moment (even
down to the login prompt and /etc/debian_release file, which still
say Debian 11).
You will need to be seeing v1.48bitfolk65 of the Xen Shell to have
it work. Any earlier version will end up installing Debian 11
https://bitfolk.com/ -- No-nonsense VPS hosting
Throwing this problem out into the wild to see if anyone has any ideas -
already had some help from Andy re-crossgrading so trying not to bother him
with all my woes.
I've upgraded to Buster from Stretch (yep, I know I'm lagging). Courier
seems to have decided to be my SMTP server instead of Exim and I can't get
any connections to either IMAP, IMAP-SSL or SMTP from the rest of the
world. Asfar as I can see courier is running and should be trying to do
IMAP type things.
Courier doesn't seem to log to its own logfiles, so what it is doing is
being logged to syslog along with a mess of other stuff.
Neither thunderbird or apple mail give me anything useful on the attempt to
connect. Just that the connection failed.
Does anyone have any good ideas?
(Note to self - go back in time about 4 years and document your mail-server)
If man has no tea in him, he is incapable of understanding truth and
beauty. ~Japanese Proverb
Find yourself a cup of tea; the teapot is behind you. Now tell me about
hundreds of things. ~Saki