Between approximately 23:07Z and 23:14Z today, due to an error in
some work our colo provider was undertaking, two servers suffered
a total network outage and our remaining servers a partial outage.
Apologies for the disruption. It is not expected to re-occur.
The two worst affected servers were "clockwork" and "macallan".
Our servers have a pair of public network interfaces and are
connected to two separate switches. The error took out one of the
switches but left our ports enabled there, so it was a blackhole for
such traffic.
At the moment we use network bonding in active-backup mode. This
didn't fail over because the link state didn't go down, so the two
servers that had the misconfigured switch as their "active"
interface experienced the longer outage.
We are in the middle of transitioning away from a bonded setup and
having separate interfaces do BGP with our colo provider and use BGP
for such redundancy. We are already doing the BGP part but have yet
to disable the bonding and split the interfaces back out. When that
is complete — which I would hope to have done in a timescale of
weeks, not months — this failure mode won't exist.
Apologies again,
https://bitfolk.com/ -- No-nonsense VPS hosting