hardware problems on barbar, 1826Z and ongoing

30 May 2012

On Wed, May 30, 2012 at 07:02:17PM +0000, Andy Smith wrote:
...
  Hi,

 We're experiencing widespread network problems, including a DDoS and
 also a possible hardware failure on barbar, all at the same time.

 I am investigating and will let you know more as soon as I can. 
Here's the situation:

At just before 1826Z I enabled network device bonding on one of our
older servers, curacao. curacao is in the process of being cleared
of customers and only had a couple of people left on it. That's why
I chose to do it first. It was not expected to cause any issues.

It was done and all seemed fine.

A couple of minutes later, alerts started firing across all servers
about things being unreachable. At the same time, customers on
barbar experienced disk IO problems which caused running processes
to crash.

A few things were established fairly quickly:

- A DDoS was in progress against a customer on barbar.
- The resolvers were unreachable
- barbar was suffering a possible hardware failure.

The customer under DDoS was blackholed and the DDoS stopped a short
time later.

The resolver failure was tracked down to some sort of split brain
scenario in the cluster, and was patched up by forcing certain nodes
to run.

For the last few hours I have been working on barbar.

barbar is one of our older servers and represents an experiment in
SAS disks instead of hardware RAID+SATA. It's therefore the only
server we have that uses Linux software RAID (mdadm).

mdadm has for some reason failed 3 of the 4 disks in one of the
RAID10 arrays, which is fatal for a RAID10. I do not believe that
there is an actual failure of any of the drives because only one of
the arrays was kicked, not all of them, and I have been able to
reboot it and access data on all the arrays that do work.

Unfortunately the array that does not work holds all customer data
for that node as well as /usr and /var for barbar itself.

Customers on barbar should come to terms with the possibility of
complete data loss at this point.

Having said that, I am trying to avoid this. As I say I believe the
data is largely there. It may be slightly corrupted due to one disk
running for a while without the others. For some people the
corruption may be minor; for others it may be huge.

Having forcibly re-assembled the array we are now at the stage where
I have access to all of the devices, but they may contain damaged
data. I am taking a backup of the usr and var data as it is now,
before I will fsck them (fsck -n already says we are in for a bumpy
but probably not catastropic ride). If I manage to get the actual
server up then I will report on what we will do next.

I do not yet know the root cause of why barbar threw three of its
disks out, but I suspect it may be related to the DDoS. I suppose it
is possible that extreme load made it decide that three of its disks
were not responding and that it was a good idea to kick them, though
I've never seen that before. I won't speculate on that further until
I have more information (having access to /var should help!).

Thanks again for your patience.

Cheers,
Andy

-- 
http://bitfolk.com/ -- No-nonsense VPS hosting

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

hardware problems on barbar, 1826Z and ongoing