Hi folks,
On Thu, Oct 01, 2009 at 06:53:10PM +0000, Andy Smith wrote:
I'm in the process of arranging access to the colo
in order to
move all customers on faustino to spare hardware (which is actually
higher spec than faustino since it was purchased more recently).
I plan to do this work from 0100Z (2am UK time) on the morning of
Monday 5th October.
The work is finished now and seems to have gone well. faustino was
shut down at 0100Z and everything was back up by 0255Z.
I spent a few minutes checking over faustino's event logs and such
to see if I could find any obvious traces of hardware fault, but
unfortunately there was none.
I then removed the disks from a spare server and from faustino.
This took ages because the supplier of faustino saw fit to screw the
disks into the disk caddies to pointless levels of tightness, even
damaging the heads as they did so. A couple of them I thought I was
going to completely destroy the heads in trying to unscrew them, and
if I had then I'd have been going home without completing the work.
Note to self: make sure you can unscrew all the disks out of the
caddies when the server is delivered.
I tested the spare server's disks in faustino to ensure the RAID
card would accept a disk swap (no issues) and then moved faustino's
disks to new hardware.
Apart from the disks, faustino now has entirely new hardware, in a
different rack in a different room. If the same problems manifest
then it can't be hardware, and we'll have to schedule a rebuild of
the OS.
I hope that the problems with this server are now behind us, and if
so then I will be adding 12 days of free service (23rd -> 5th) for
all of the customers on faustino. I'm going to wait a week before
doing this though, just in case celebrations are premature. Please
continue paying your bills as normal! ;)
faustino used to be a dual core Opteron 2212HE 2GHz with 12GiB RAM.
It is now a quad core Xeon 5410 2.33GHz with 16GiB RAM.
Thanks for your patience in this matter. The 2 power cycles and a
~2 hour outage you've suffered so far is not great, though hardware
problems and software bugs are unavoidable. Realistically when
these sorts of things happen I don't think there's much scope for a
faster response or resolution than this, at this end of the hosting
market.
Fortunately BitFolk is now big enough to have spare hardware about,
making this a lot less painful than it might have been 2 years ago.
I can improve this by adding more spare servers, being more willing
to whip the disks out and make use of the spares when there is any
hint of hardware issues.
Shared storage would of course help but this is a scary topic!
Happy to continue that discussion off-list..
Cheers,
Andy
--
http://bitfolk.com/ -- No-nonsense VPS hosting
"Xandros's low-level support for the Eee mostly seemed to consist of a pile of
shell scripts made of cheese and failure." -- Matthew Garrett