Hi,
Between about 20:25Z and ~20:50Z today host "Jack" lost all
networking. All of the VMs on it became unreachable.
It seems to have been some sort of kernel driver bug in the
Ethernet module as it was "stuck" not passing traffic but the
interface still showed as up.
The hosts have bonded network interfaces to protect against switch
failure, but as the interface stayed up this was not considered
failed. Also they are in active-backup mode and the currently-active
interface was the one that was stuck, so all traffic was trying to
go that way.
Networking was restored by setting the link down and up again.
Traffic started to flow again, BGP sessions re-established and all
was fine again.
We could look into some sort of link keepalive method on the bonded
interfaces as opposed to just relying on link state, but we have
already decided to move away from bonded networking in favour of
separate BGP sessions on each interface, That is how the next new
servers will be deployed; they will not have network bonding. We
have not yet tackled moving existing servers to this setup.
If we had been in the situation without bonding I think we would
have fared better here: there would have been a short blip while one
BGP session went down, but the other would remain and we'd be left
with some alerting and me scratching my head wondering why an
interface that is up doesn't pass traffic.
I will do some more investigation of this failure mode but in light
of doing away with bonding being the direction we are already going,
I don't think I want to alter how bonding is done on what will soon
be a legacy setup.
Thanks,
Andy
--
https://bitfolk.com/ -- No-nonsense VPS hosting
>
> I will do some more investigation of this failure mode but in light of
> doing away with bonding being the direction we are already going, I don't
> think I want to alter how bonding is done on what will soon be a legacy
> setup.
Shouldn't this failure mode have been caught by LACPDUs?
--
Maria Blackmore
I am responsible for several VPSes, here and elsewhere. Five of them are
running Ubuntu 22.04, three Debians.
The script I use to update them checks, at the end, for the existence of
/var/run/reboot-required
If it finds it, it offers to reboot the VPS.
It does this happily on all but one VPS, one of the Ubuntu ones here. The
Ubuntu version of apt-get on all of the Ubuntu ones recognises that a
reboot is required after a kernel update etc and will popup a message
saying so, but it looks like only on this single machine, that file doesn't
exist afterwards.
I have no idea why not. Anyone got any ideas?
Ian