<
mailto:users-request@lists.bitfolk.com?subject=unsubscribe>
List-Archive: <http://lists.bitfolk.com/lurker/list/users.html>
List-Post: <mailto:users@lists.bitfolk.com>
List-Help: <mailto:users-request@lists.bitfolk.com?subject=help>
List-Subscribe: <https://lists.bitfolk.com/mailman/listinfo/users>,
<mailto:users-request@lists.bitfolk.com?subject=subscribe>
X-List-Received-Date: Wed, 30 May 2012 22:27:55 -0000
--===============1596240413==
Content-Type: multipart/signed; micalg=pgp-ripemd160;
protocol="application/pgp-signature"; boundary="H1spWtNR+x+ondvy"
Content-Disposition: inline
--H1spWtNR+x+ondvy
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable
On Wed, May 30, 2012 at 07:02:17PM +0000, Andy Smith wrote:
> Hi,
>=20
> We're experiencing widespread network problems, including a DDoS and
> also a possible hardware failure on barbar, all at the same time.
>=20
> I am investigating and will let you know more as soon as I can.
Here's the situation:
At just before 1826Z I enabled network device bonding on one of our
older servers, curacao. curacao is in the process of being cleared
of customers and only had a couple of people left on it. That's why
I chose to do it first. It was not expected to cause any issues.
It was done and all seemed fine.
A couple of minutes later, alerts started firing across all servers
about things being unreachable. At the same time, customers on
barbar experienced disk IO problems which caused running processes
to crash.
A few things were established fairly quickly:
- A DDoS was in progress against a customer on barbar.
- The resolvers were unreachable
- barbar was suffering a possible hardware failure.
The customer under DDoS was blackholed and the DDoS stopped a short
time later.
The resolver failure was tracked down to some sort of split brain
scenario in the cluster, and was patched up by forcing certain nodes
to run.
For the last few hours I have been working on barbar.
barbar is one of our older servers and represents an experiment in
SAS disks instead of hardware RAID+SATA. It's therefore the only
server we have that uses Linux software RAID (mdadm).
mdadm has for some reason failed 3 of the 4 disks in one of the
RAID10 arrays, which is fatal for a RAID10. I do not believe that
there is an actual failure of any of the drives because only one of
the arrays was kicked, not all of them, and I have been able to
reboot it and access data on all the arrays that do work.
Unfortunately the array that does not work holds all customer data
for that node as well as /usr and /var for barbar itself.
Customers on barbar should come to terms with the possibility of
complete data loss at this point.
Having said that, I am trying to avoid this. As I say I believe the
data is largely there. It may be slightly corrupted due to one disk
running for a while without the others. For some people the
corruption may be minor; for others it may be huge.
Having forcibly re-assembled the array we are now at the stage where
I have access to all of the devices, but they may contain damaged
data. I am taking a backup of the usr and var data as it is now,
before I will fsck them (fsck -n already says we are in for a bumpy
but probably not catastropic ride). If I manage to get the actual
server up then I will report on what we will do next.
I do not yet know the root cause of why barbar threw three of its
disks out, but I suspect it may be related to the DDoS. I suppose it
is possible that extreme load made it decide that three of its disks
were not responding and that it was a good idea to kick them, though
I've never seen that before. I won't speculate on that further until
I have more information (having access to /var should help!).
Thanks again for your patience.
Cheers,
Andy
--=20
http://bitfolk.com/ -- No-nonsense VPS hosting
--H1spWtNR+x+ondvy
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: Digital signature
Content-Disposition: inline
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
iEYEAREDAAYFAk/GnucACgkQIJm2TL8VSQuoUgCgmjU9a+jnmN/bj6nr74pOdPWk
c14AoKvTnmJOW77HCPFwq1T8fqXkcWGO
=C7Lp
-----END PGP SIGNATURE-----
--H1spWtNR+x+ondvy--
--===============1596240413==
Content-Type: text/plain; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
_______________________________________________
announce mailing list
announce@???
https://lists.bitfolk.com/mailman/listinfo/announce
--===============1596240413==--
From announce-bounces+users=lists.bitfolk.com@??? Thu May 31 02:48:12 2012
Received: from localhost ([127.0.0.1] helo=bitfolk.com)
by mail.bitfolk.com with esmtp (Exim 4.72) (envelope-from
<announce-bounces+users=lists.bitfolk.com@???>)
id 1SZvQq-00049l-UM
for users@???; Thu, 31 May 2012 02:48:12 +0000
Received: from andy by mail.bitfolk.com with local (Exim 4.72)
(envelope-from <andy@???>) id 1SZvQl-00048g-LT
for announce@???; Thu, 31 May 2012 02:48:08 +0000
Date: Thu, 31 May 2012 02:48:07 +0000
From: Andy Smith <andy@???>
To: announce@???
Message-ID: <20120531024807.GD11695@???>
References: <20120530190217.GB11695@???>
<20120530222751.GC11695@???>
MIME-Version: 1.0
In-Reply-To: <20120530222751.GC11695@???>
OpenPGP: id=BF15490B; url=http://strugglers.net/~andy/pubkey.asc
X-URL: http://strugglers.net/wiki/User:Andy
User-Agent: Mutt/1.5.20 (2009-06-14)
X-Virus-Scanner: Scanned by ClamAV on mail.bitfolk.com at Thu,
31 May 2012 02:48:07 +0000
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
spamd0.lon.bitfolk.com
X-Spam-Level:
X-Spam-ASN:
X-Spam-Status: No, score=-0.0 required=5.0 tests=NO_RELAYS shortcircuit=no
autolearn=disabled version=3.3.1
X-Spam-Report: * -0.0 NO_RELAYS Informational: message was not relayed via SMTP
X-BeenThere: announce@???
X-Mailman-Version: 2.1.13
Precedence: list
Content-Type: multipart/mixed; boundary="===============0668465934=="
Sender: announce-bounces+users=lists.bitfolk.com@???
Errors-To: announce-bounces+users=lists.bitfolk.com@???
X-Virus-Scanner: Scanned by ClamAV on mail.bitfolk.com at Thu,
31 May 2012 02:48:12 +0000
X-SA-Exim-Connect-IP: 127.0.0.1
X-SA-Exim-Mail-From: announce-bounces+users=lists.bitfolk.com@???
X-SA-Exim-Scanned: No (on mail.bitfolk.com); SAEximRunCond expanded to false
Subject: Re: [bitfolk] hardware problems on barbar, 1826Z and ongoing
X-BeenThere: users@???
Reply-To: users@???
List-Id: Users of BitFolk hosting <users.lists.bitfolk.com>
List-Unsubscribe: <https://lists.bitfolk.com/mailman/options/users>,
<mailto:users-request@lists.bitfolk.com?subject=unsubscribe>
List-Archive: <http://lists.bitfolk.com/lurker/list/users.html>
List-Post: <mailto:users@lists.bitfolk.com>
List-Help: <mailto:users-request@lists.bitfolk.com?subject=help>
List-Subscribe: <https://lists.bitfolk.com/mailman/listinfo/users>,
<mailto:users-request@lists.bitfolk.com?subject=subscribe>
X-List-Received-Date: Thu, 31 May 2012 02:48:13 -0000
--===============0668465934==
Content-Type: multipart/signed; micalg=pgp-ripemd160;
protocol="application/pgp-signature"; boundary="/Uq4LBwYP4y1W6pO"
Content-Disposition: inline
--/Uq4LBwYP4y1W6pO
Content-Type: text/plain; charset=iso-8859-1
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable
On Wed, May 30, 2012 at 10:27:51PM +0000, Andy Smith wrote:
> Having forcibly re-assembled the array we are now at the stage where
> I have access to all of the devices, but they may contain damaged
> data. I am taking a backup of the usr and var data as it is now,
> before I will fsck them (fsck -n already says we are in for a bumpy
> but probably not catastropic ride). If I manage to get the actual
> server up then I will report on what we will do next.
barbar itself has been up for a little while now. Damage to its /usr
and /var filesystems appear not to have been too severe.
I have also (with permission) managed to fsck two customer's
filesystems and boot their VPSes. One of them appeared to have only
minimal damage (about what you would expect from a hard power
cycle), the other completed an fsck without incident.
At the moment, every other customer on barbar is administratively
locked out of their Xen Shell in order to prevent them from starting
their VPSes and going straight into an fsck, as they may not have
followed all of this and may be unaware of the potential scale of
the problem.
What we're going to do:
For each customer hosted on barbar whose VPS is still down, I will:
- Run an fsck -n on their block devices provided they are ext3.
IFF that fsck -n returns cleanly:
- Start customer's VPS
OTHERWISE:
- Put a warning message in place in the Xen Shell directing
customers to the URL of this archived email.
- Take a backup copy of the customer's block devices
- Open a support ticket with the customer using the email
address we have on file for them
This support ticket will say something along the lines of:
Wah wah sky has fallen, etc.=B9 As a result your VPS is not
currently running. When it IS started up, either by you or
by us, it will very very likely need to have an fsck run and
proceed to do this during the boot process.
Many block devices will have corruption and doing an fsck
could possibly make this worse. Therefore we need you to
reply to this support ticket