Hi,
TL;DR: We turned off suspend+restore for everyone. We think it is
okay for you to re-enable it as long as you use kernel 4.2 or newer
(released 6 years ago), but can't tell what kernel you're running so
erred on the side of caution. We continue to use it for our own VMs.
More detail:
We've just opted you all out of suspend+restore because of the
filesystem corruption that afflicted 2 customer VMs during the
maintenance in August. There were 83 customer VMs that previously
had opted in.
While investigating that we did of course not do any suspend+restore
anyway. I am now satisfied that we know why it happened and under
what circumstances it should be safe to use it again, but as a
precaution we have opted everyone out of it so you can make your own
decisions.
A direct email has gone out to the main contact for each VM that had
previously opted in to this. That email contains far more detail. If
you think you had opted in to suspend+restore but don't see that
email please check your spam folders etc (and then mark it as "not
spam" if necessary!).
You can see the current setting (or opt back in) here:
https://panel.bitfolk.com/account/config/#prefs
You can read more about suspend+restore here:
https://tools.bitfolk.com/wiki/Suspend_and_restore
Thanks,
Andy
--
https://bitfolk.com/ -- No-nonsense VPS hosting
_______________________________________________
announce mailing list
announce(a)lists.bitfolk.com
https://lists.bitfolk.com/mailman/listinfo/announce
Hi All,
I am trying to ssh back from my Bitfolk VPN into my home system, and it
has stopped working.
Traceroute shows the following....
ian@hobsoni:~$ traceroute -4 109.51.83.178
traceroute to 109.51.83.178 (109.51.83.178), 30 hops max, 60 byte packets
1 macallan.bitfolk.com (85.119.80.25) 0.249 ms 0.636 ms 0.566 ms
2 jump-gw-3.lon.bitfolk.com (85.119.80.3) 3.064 ms 3.533 ms 3.931 ms
3 t2.jump.net.uk (194.153.169.238) 0.370 ms 0.360 ms 0.400 ms
4 as2914.jump.net.uk (194.153.169.185) 0.601 ms 0.483 ms 0.658 ms
5 195.219.23.72 (195.219.23.72) 0.977 ms 0.862 ms 0.758 ms
6 if-ae-66-2.tcore1.ldn-london.as6453.net (80.231.60.144) 29.946 ms
31.290 ms 31.202 ms
7 * * *
8 if-ae-2-2.tcore2.l78-london.as6453.net (80.231.131.1) 30.463 ms
30.381 ms if-ae-11-2.tcore2.sv8-highbridge.as6453.net (80.231.139.41)
29.692 ms
9 if-ae-2-2.tcore1.sv8-highbridge.as6453.net (80.231.139.2) 29.665
ms if-ae-19-2.tcore1.sv8-highbridge.as6453.net (80.231.138.21) 29.863
ms if-ae-2-2.tcore1.sv8-highbridge.as6453.net (80.231.139.2) 29.932 ms
10 if-ae-1-3.tcore1.pv9-lisbon.as6453.net (80.231.158.29) 33.713 ms
29.070 ms 28.481 ms
11 if-ae-2-2.tcore2.pv9-lisbon.as6453.net (80.231.158.6) 28.858 ms
28.497 ms 28.604 ms
12 195.219.214.18 (195.219.214.18) 28.540 ms 28.010 ms 28.446 ms
13 * * *
14 * * *
15 * * *
16 * * *
17 * * *
18 * * *
19 * * *
20 * * *
21 * * *
22 * * *
23 * * *
24 * * *
25 * * *
26 * * *
27 * * *
28 * * *
29 * * *
30 * * *
ian@hobsoni:~$
What does this mean about 195.219.214.18?
And who do I contact to get things put right?
Many thanks
Ian
--
Ian Hobson
Tel (+351) 910 418 473
--
This email has been checked for viruses by AVG.
https://www.avg.com
Hi all,
Hopefully an easy one I can get help with.
I would like to add a new user to the server.
I found the following guide works for me:
https://thucnc.medium.com/how-to-create-a-sudo-user-on-ubuntu-and-allow-ssh…
Is there a gotcha that will cause a problem down the line?
The server is used as a playground for the weareveryone.org no serious
coding.
Regards,
Andres
--
Andres Muniz-Piniella (he/him/his) CEng mIoP
Everyone’s Warehouse Manager
Participatory City
47 Thames Road
IG11 0HQ
+44 7704 003974
Sent while on the Move
Please tell me how was your day at the Warehouse!
https://eoed.typeform.com/to/i2q8Gf
Hi,
We began receiving alerts at approximately 03:02Z today that host
"macallan" was unresponsive.
There was nothing interesting on its serial console. Its console
also did not respond. The out of band access to the BMC worked but
didn't show anything unusual. There were no hardware events logged.
In the face of a hard lock up all I could do was power cycle it.
All customer VMs were booted again by about 03:30Z.
I'll be keeping a close eye on this server. If this repeats then we
may have to move customers off of it at speed and with little
notice.
Apologies for the disruption this has caused.
Cheers,
Andy
--
https://bitfolk.com/ -- No-nonsense VPS hosting
_______________________________________________
announce mailing list
announce(a)lists.bitfolk.com
https://lists.bitfolk.com/mailman/listinfo/announce
Hello,
TL;DR: There were some serious problems with suspend+restore during
the last maintenance and 2 customer VMs were shredded. It looks like
that was due to a kernel bug on the guest side which was fixed in
Linux v4.2 but until we can test that hypothesis we won't be doing
any more suspend+restore. If that holds true then we have to decide
how/if to reintroduce the feature. Unless you have an interest in
this topic you can ignore this email. There's nothing you can/should
do at this stage - though regardless of this you should keep your
operating system up to date of course.
Detailed version follows:
As you're probably aware, we allow customers to opt in to something
we call "suspend and restore". There's more info about that here:
https://tools.bitfolk.com/wiki/Suspend_and_restore
A summary is: if you opt in to it then any time we need to shut the
bare metal host down or move your VM between bare metal hosts, we
save your VM's state to storage and then restore it again
afterwards. The effect seems like a period of paused execution, so
everything remains running and often even TCP sessions will remain
alive. It's a lot less disruptive than a shutdown and boot.
It hasn't always worked perfectly. Sometimes, usually with older
kernels, what is restored doesn't work properly. It locks up or
spews error messages on the console and has to be forcibly
terminated and then booted. These problems have tended to be
deterministic, i.e. either your kernel has problems or it's fine,
and this is repeatable, so when someone has had problems we've
advised them to opt back out of suspend+restore.
Also when there have been problems it hasn't been destructive. I've
never seen the on-boot fsck say more than "unclean shutdown". So
I've always regarded suspend+restore to be pretty safe.
All BitFolk infrastructure VMs use suspend+restore; this is
currently 35 VMs. I've done it several hundred times at this point
and before this week the failure rate was very low and failure mode
not too terrible,
Suspend+restore was used during the maintenance this week on the
first two servers, "clockwork" and "hobgoblin", for customers who
opted in (and BitFolk stuff). There was nothing to indicate problems
with the 10 VMs on "clockwork" that were suspended and restored.
Of the 17 VMs on "hobgoblin" that had opted in, two of them failed
to restore and were unresponsive. Destroy and boot was the only
option left, and then it was discovered that they had both suffered
serious filesystem corruption which was not recoverable.
Losing customer data is the worst. It hasn't happened in something
like 8 years and even then that was down to (my) human error not a
bug in the software we use to provide the service.
After having that happen we did not honour requests to
suspend+restore for the remainder of the maintenance on other
servers and the work proceeded without incident other than that.
We've tracked this bug down to this:
https://lore.kernel.org/lkml/1437449441-2964-1-git-send-email-bob.liu@oracl…
This is fixing a bug in the Linux kernel on the guest side. It's in
the block device driver that Xen guests use. It made its way
upstream in v4.2-rc7 of the Linux kernel.
If I understand it correctly it's saying that across a migration (a
suspend+restore is a migration on the same host) this change is
necessary for a guest to notice if a particular feature/capability
has changed, and act accordingly.
So, without the patch there can be a mismatch between what the
backend driver on the bare metal host and the frontend driver in
your VM understand about the protocol they use, and that is what
caused the corruption for these two VMs.
It is believed that this has never been seen before because we only
recently upgraded our host kernels from 4.19 to 5.10, which is using
some new features. There hasn't been any suspending and restoring
happening with those newer host kernels until this last week. Though
I did test that, but admittedly not with any EOL guest operating
systems.
The two VMs were running obsolete kernels: A Debian 8 (jessie) 3.16
kernel and a CentOS 6 2.6.32 kernel. These kernels are never going
to receive official backports of that patch because they're out of
support by their distributors.
Since reporting this issue another person has told me that they've
now tested migration with Debian 8 guests and it breaks every time
for them, sometimes with disastrous circumstances as we have seen.
I am at this time unable to explain why several other customer VMs
did not experience this calamity, though I am of course glad that
they did not. Out of the 27, many were running kernels as old as
this or even older, but only 2 exploded like this.
So what are we going to do about it?
I think initially we are of course not going to be able to use
suspend+restore any more even if you have opted in to it. We just
won't honour that setting for now.
Meanwhile I think I will test the hypothesis that it's okay with guest
kernels newer than 4.2. Obviously if it's not then the feature
remains disabled. But assuming I cannot replicate this issue with
kernels that have the fix, then we have to decide if and how we're
going to reintroduce this feature.
The rest of this email assumes that guest kernels of 4.2+ are okay,
in which case I am minded to:
1. Reset everyone back to opting out
2. Add a warning on the opt in bit that says it mustn't be used with
kernels older than 4.2 because of this known and serious bug
3. Post something on the announce list saying to opt back in again
if you want (with details of what's been tested etc.)
We can't easily tell what kernels people are running so we don't
have the option of disabling it just for those running older
kernels. There are 85 customer VMs that have opted in to
suspend+restore.
When the time comes we can perhaps do some testing for people who
are interested in re-enabling that but want more reassurance. I can
imagine that this testing would take the form of:
1. Snapshot your storage while your VM is running
2. Suspend+restore your VM.
3. If it works, great. If it explodes then we rollback your
snapshot, which would take a couple of minutes. This would appear
to your operating system to be like a power off or abrupt crash.
Which is something that all software should be robust against,
but occasionally isn't.
I'm undecided on whether it will be worth sending a direct email to
the contacts of those 85 VMs with the background info and offer of
this testing or whether just posting something to the announce list
will be enough.
If you are running a kernel this old then I would of course always
recommend an upgrade or reinstall anyway, regardless of this. You
don't have any security support in your kernel and it's at least 6
years since its release.
On Debian jessie it is fairly easy to just install the
jessie-backports kernel (a 4.9 kernel):
https://www.lucas-nussbaum.net/blog/?p=947
Summary:
# echo 'Acquire::Check-Valid-Until no;' > /etc/apt/apt.conf.d/99no-check-valid-until
# echo 'deb http://archive.debian.org/debian/ jessie-backports main' > /etc/apt/sources.list.d/backports.list
# apt update
# apt install linux-image-amd64/jessie-backports
or:
# apt install linux-image-686/jessie-backports
Doing this will (a) disable checking of the validity of all your
package sources, and (b) still leave you on a kernel that has no
security support. But, you already were in that position.
No one needs to take any action at this stage because we just won't
be doing any more suspend+restore until we know more, and probably
the next step after that will be to opt everyone back out and tell
you that.
Your comments and thoughts on all of this are welcome.
Cheers,
Andy
--
https://bitfolk.com/ -- No-nonsense VPS hosting
Hello,
Unfortunately some serious security bugs have been discovered in the
Xen hypervisor and fixes for these have now been pre-disclosed, with
an embargo that ends at 1200Z on 25 August 2021.
As a result we will need to apply these fixes and reboot everything
before that time. We are likely to do this in the early hours of the
morning UK time, on Tuesday 24 and Wednesday 25 August.
In the next few days individual emails will be sent out confirming
to you which hour long maintenance window your services are in. The
times will be in UTC; please note that UK is currently observing
daylight savings and as such is currently at UTC+1.
We expect the work to take between 15 and 45 minutes per bare metal
host. We are going to take the opportunity to complete upgrading the
kernel and hypervisor on some of the hosts that haven't had that
done yet, which is why the work may take a few minutes more for some
hosts.
There are two hosts left that we are trying to migrate customers off
of ("hen" and "paradox"). That was supposed to be done by now but
that effort has been hampered by the other issues we've been having
and is dragging on. We don't intend to patch or reboot those two
hosts, instead mitigating issues with configuration and renewing
efforts to clear customers off of them. If you are concerned about
that we will be happy to move your service as a priority.
If you have opted in to suspend and restore¹ then your VM will be
suspended to storage and restored again after the host it is on is
rebooted. Otherwise your VM will be cleanly shut down and booted
again later.
If you cannot tolerate the downtime then please contact
support(a)bitfolk.com. We will be able to migrate² you to
already-patched hardware before the regular maintenance starts, at a
time of your choosing. You can expect a few tens of seconds of
pausing in that case. This process uses suspend&restore so has the
same caveats.
Thanks,
Andy
¹ https://tools.bitfolk.com/wiki/Suspend_and_restore
² https://tools.bitfolk.com/wiki/Suspend_and_restore#Migration
--
https://bitfolk.com/ -- No-nonsense VPS hosting
_______________________________________________
announce mailing list
announce(a)lists.bitfolk.com
https://lists.bitfolk.com/mailman/listinfo/announce
Hi,
I've switched the debian_testing install target over to bookworm and
tried it out. I've tested it in amd64 PVH mode only at the moment,
and it worked.
It is however pretty much identical to bullseye at the moment (even
down to the login prompt and /etc/debian_release file, which still
say Debian 11).
You will need to be seeing v1.48bitfolk65 of the Xen Shell to have
it work. Any earlier version will end up installing Debian 11
(bullseye).
Cheers,
Andy
--
https://bitfolk.com/ -- No-nonsense VPS hosting
Hello All,
Throwing this problem out into the wild to see if anyone has any ideas -
already had some help from Andy re-crossgrading so trying not to bother him
with all my woes.
I've upgraded to Buster from Stretch (yep, I know I'm lagging). Courier
seems to have decided to be my SMTP server instead of Exim and I can't get
any connections to either IMAP, IMAP-SSL or SMTP from the rest of the
world. Asfar as I can see courier is running and should be trying to do
IMAP type things.
Courier doesn't seem to log to its own logfiles, so what it is doing is
being logged to syslog along with a mess of other stuff.
Neither thunderbird or apple mail give me anything useful on the attempt to
connect. Just that the connection failed.
Does anyone have any good ideas?
(Note to self - go back in time about 4 years and document your mail-server)
cheers,
David
If man has no tea in him, he is incapable of understanding truth and
beauty. ~Japanese Proverb
Find yourself a cup of tea; the teapot is behind you. Now tell me about
hundreds of things. ~Saki
Hi,
The new stable release of Debian, bullseye, was released over the
weekend:
https://bits.debian.org/2021/08/bullseye-released.html
This is now supported for self-install:
https://tools.bitfolk.com/wiki/Using_the_self-serve_net_installer#Debian
by doing "install debian_bullseye", and also of course as a new
order.
If upgrading in-place from buster to bullseye please make sure to
read the release notes as there are a few things to be aware of:
https://www.debian.org/releases/stable/amd64/release-notes/ch-upgrading.en.…
There aren't any known BitFolk-specific issues with bullseye, though
we do suggest that if you're running buster or beyond that you do so
in PVH mode:
https://tools.bitfolk.com/wiki/PVH
I *think* it is still possible to install the new testing release
(bookworm) by doing "install debian_testing" but we have to check
that and fix it if necessary, which will happen later this week.
If you're desperate to do a clean install of Debian testing and you
find that "install debian_testing" doesn't work then I recommend
installing bullseye and doing an in-place upgrade from there (will
be almost identical right now).
Cheers,
Andy
--
https://bitfolk.com/ -- No-nonsense VPS hosting
_______________________________________________
announce mailing list
announce(a)lists.bitfolk.com
https://lists.bitfolk.com/mailman/listinfo/announce
Are there any known routing issues at the moment? I'm unable to ping my
VM on macallan over IPv4 but can over IPv6. There also appears to be a
lot of packet loss over an ICMP traceroute, too.
I also have an MRTG script running every 5 minutes on my VM that pings
an AAISP-provided IP address, and that's failing even though the host is up.
Cheers,
Mike