Hi,
On Wed, Jul 22, 2009 at 11:45:35PM +0100, Philip Veale wrote:
I will be shutting down every server in turn and
hopefully keeping
them off for no more than about 30 minutes each. I'm not intending
to do suspend/restore, so therefore you can expect a clean shut down
and then boot up again around 30 minutes later.
Don't forget, this is happening tonight from 1800Z (7pm UK time).
Is this still ongoing; were there complications?
Yes I'm afraid there was a problem with corona, the last server I
worked on, and the circumstances particularly affected yourself.
Firstly it took me slightly longer to work on it than I'd hoped, as
getting the old power supply out was a bit trickier than I thought it
would be.
Secondly, after I'd re-racked it, the network didn't work anymore.
I suspected either I'd plugged the wrong Ethernet cable into it or
else I'd fried something. I was a bit pushed for time by this point
though and the servers have two Ethernet controllers, so I just
configured it to use the second one instead, intending to follow the
matter up later.
The next problem was that the ARP tables of the upstream routers had
the MAC address of corona's eth0 as the next hop for everyone's IP
addresses. I used arping to send out ARP adverts for eth1 for all
the IPs, checked a few things to see if it had worked, and hastily
packed up.
At this point I had overlooked something. I had only ARPed for IPs
inside BitFolk's IP range. You are unique in that you came with
your own IP range, and I hadn't advertised that again, so all your
traffic was hitting my dead eth0. Your VPS remained without network
whilst I made my way home.
Just as I got home and went online to check everything, I saw corona
power cycle itself. That was pretty worrying (I'd just replaced the
PSU...), until I got a text message from the colo provider saying
that he'd noticed I'd put the wrong Ethernet cable into corona and
he had fixed this for me, but while doing so he'd tripped the power
as the cable had been very loose - that was my fault.
So, everyone on corona got an unexpected power cycle at around
0010Z. Please accept my apologies for this disruption.
As far as I can see at present, everyone on corona is up and
running. I have also just fixed the networking for your VPS,
Philip; double apologies for the length of your downtime.
Other than the above, the work tonight went reasonably well. I
didn't get quite as much power saving as I'd hoped, because a lot of
the servers were already configured properly, but I did get a
worthwhile amount and I have learnt some more about what works for
the future, before the servers are racked.
Some other lessons learnt:
- It probably would have been better to budget an hour per server
for this. Most were much faster but it only takes a few awkward
issues to blow the schedule.
- Stupid mistakes made when rushing to finish work on corona;
splitting the work over two nights would have been worth it, and
would have been necessary if I'd had to work on the other three
servers too.
- LDAP broke when the master server was down leading to no one being
able to connect to their consoles whilst faustino was down. There
are two replicas so this shouldn't have happened, but obviously
they didn't work properly and this needs to be addressed.
- Flipping the BitFolk web site over to another location with a
holding page explaining what was going on would have been useful
for the people who aren't on this mailing list.
There will be some kind of service credit issued to those on corona
who experienced an unexpectedly long outage. I will contact you
individually about this in the next few days.
Apologies again for the inconvenience caused.
Cheers,
Andy
--
http://bitfolk.com/ -- No-nonsense VPS hosting
"Xandros's low-level support for the Eee mostly seemed to consist of a pile of
shell scripts made of cheese and failure." -- Matthew Garrett