Hello,
On Wed, Nov 26, 2014 at 03:27:10PM +0000, Andy Smith wrote:
- How kwak came to be power cycled
Someone from our colo provider was working in the rack (on other
hardware) at the time of the power interruption and most likely
knocked into the power distribution unit (PDU) end of the cable
causing a momentary power loss. They were unaware that they had done
this, but the power loss/restore happened at the exact time they
were working in the rack.
I've had a chat with them about this, and we've come to the
conclusion that neither of us are happy with the PDUs in that rack
as the PDU end sockets are all rather loose and it's too easy for
this to occur.
While
kwak.bitfolk.com was out of service we went through the
cabling and made sure it was secure.
They have already ordered new PDUs which support locking cables,
i.e., you plug them in and then they don't come out no matter how
hard you knock them because they have a catch on them. These should
be available in January and new hardware will use these; I'll then
have to make a decision about whether to have that machine re-cabled
(will involve scheduled maintenance if so).
- Why it didn't boot with networking enabled
When kwak booted after its power cycle it seemed to be in a state
where neither of its network interfaces were up. This was highly
confusing and in fact I initially thought that perhaps both network
cables had been unplugged. This was actually due to a configuration
error on my part.
kwak.bitfolk.com is one of the oldest servers we have. It had
actually been up for several years when it was power cycled.
All of our servers have bonded networking. That is, there's two
network interfaces each of which is cabled to a separate switch in
the rack. Should the network port, cable or switch die then the
machine should be able to continue using the other path.
In order to do this we use the standard Linux "bonding" driver that
takes over the two network interfaces eth0 and eth1 and creates a
new one, bond0, that is used instead. That's the one that is
mentioned in the host's networking configuration.
kwak was not configured correctly and the "bonding" kernel module
was not loaded on boot. So, bond0 did not exist and networking was
not brought up. eth0 and eth1 had a link but hadn't been set as "up"
because they are not mentioned in configuration.
When kwak was originally installed in the rack it didn't actually
have resilient networking (i.e. it didn't use bonding, it just used
eth0). This configuration was switched to on the fly later on
without rebooting it, and while it obviously had a working active
configuration it had never actually been tested from a boot, and
that part had not been done correctly.
The main problem this actually caused was one of confusion on my
part, and time was wasted wondering if there was a problem with the
cabling and/or switches, to the point of actually asking someone to
trace the cables. It was only when they said, "I can see link
lights" that I even thought to check that the interfaces had carrier
(were plugged in and seeing the switch on the other end) but were
just not configured.
So, obviously I've corrected that configuration and also the
configuration on the one other server that was in the same
situation. More to the point I feel this highlights a need for an
improvement in process so that information about individual servers,
e.g. "this server has never had its bonded networking tested from a
clean boot" is retained somewhere except for in my own head, where
it is prone to bit rot over a period of years.
- Why IPv6 broke for everyone even though it should
have failed over
to another router.
This bit I am still looking into.
Cheers,
Andy
--
http://bitfolk.com/ -- No-nonsense VPS hosting
"I am the permanent milk monitor of all hobbies!" — Simon Quinlank