2014-11-26 ~1432Z: Problem with kwak.bitfolk.com, and IPv6 (all hosts)

List overview All Threads
Download

newer

older

Server time

please repost an old post

Andy Smith

26 Nov 2014 26 Nov '14

3:27 p.m.

Hi, Around 1432Z IPv6 connectivity to all hosts was lost, and VPSes on kwak.bitfolk.com became unreachable (both IPv4 and v6). Subsequent investigation has revealed that kwak.bitfolk.com was unexpectedly power cycled and returned in a configuration that had no networking. IPv6 connectivity was restored at around 1503Z and VPSes hosted on kwak.bitfolk.com are now in the process of being booted again. If you are unable to reach your VPS, and it is hosted on kwak.bitfolk.com¹, please log in to your Xen Shell and look at its console to see what is happening: https://tools.bitfolk.com/wiki/Xen_Shell There is a high possibility that the VPS is still booting, is performing a filesystem check, or has failed to boot because of some configuration problem local to your VPS. If you have ruled all of those out then please do send a support ticket to support(a)bitfolk.com. For those of you with Nagios monitoring set up I will be watching to make sure any alerts recover where that is within my power. To follow: - How kwak came to be power cycled - Why it didn't boot with networking enabled - Why IPv6 broke for everyone even though it should have failed over to another router. Cheers, Andy ¹ If you don't know, you can find out which piece of hardware your VPS is hosted on as follows: https://bitfolk.com/customer_information.html#toc_3_Which_piece_of_actual_h… -- http://bitfolk.com/ -- No-nonsense VPS hosting Please consider the environment before reading this e-mail. — John Levine _______________________________________________ announce mailing list announce(a)lists.bitfolk.com https://lists.bitfolk.com/mailman/listinfo/announce

Attachments:

signature.asc (application/pgp-signature — 198 bytes)

Show replies by date

Andy Smith

27 Nov 27 Nov

9:25 a.m.

Hello, On Wed, Nov 26, 2014 at 03:27:10PM +0000, Andy Smith wrote:

...

- How kwak came to be power cycled

Someone from our colo provider was working in the rack (on other hardware) at the time of the power interruption and most likely knocked into the power distribution unit (PDU) end of the cable causing a momentary power loss. They were unaware that they had done this, but the power loss/restore happened at the exact time they were working in the rack. I've had a chat with them about this, and we've come to the conclusion that neither of us are happy with the PDUs in that rack as the PDU end sockets are all rather loose and it's too easy for this to occur. While kwak.bitfolk.com was out of service we went through the cabling and made sure it was secure. They have already ordered new PDUs which support locking cables, i.e., you plug them in and then they don't come out no matter how hard you knock them because they have a catch on them. These should be available in January and new hardware will use these; I'll then have to make a decision about whether to have that machine re-cabled (will involve scheduled maintenance if so).

...

- Why it didn't boot with networking enabled

When kwak booted after its power cycle it seemed to be in a state where neither of its network interfaces were up. This was highly confusing and in fact I initially thought that perhaps both network cables had been unplugged. This was actually due to a configuration error on my part. kwak.bitfolk.com is one of the oldest servers we have. It had actually been up for several years when it was power cycled. All of our servers have bonded networking. That is, there's two network interfaces each of which is cabled to a separate switch in the rack. Should the network port, cable or switch die then the machine should be able to continue using the other path. In order to do this we use the standard Linux "bonding" driver that takes over the two network interfaces eth0 and eth1 and creates a new one, bond0, that is used instead. That's the one that is mentioned in the host's networking configuration. kwak was not configured correctly and the "bonding" kernel module was not loaded on boot. So, bond0 did not exist and networking was not brought up. eth0 and eth1 had a link but hadn't been set as "up" because they are not mentioned in configuration. When kwak was originally installed in the rack it didn't actually have resilient networking (i.e. it didn't use bonding, it just used eth0). This configuration was switched to on the fly later on without rebooting it, and while it obviously had a working active configuration it had never actually been tested from a boot, and that part had not been done correctly. The main problem this actually caused was one of confusion on my part, and time was wasted wondering if there was a problem with the cabling and/or switches, to the point of actually asking someone to trace the cables. It was only when they said, "I can see link lights" that I even thought to check that the interfaces had carrier (were plugged in and seeing the switch on the other end) but were just not configured. So, obviously I've corrected that configuration and also the configuration on the one other server that was in the same situation. More to the point I feel this highlights a need for an improvement in process so that information about individual servers, e.g. "this server has never had its bonded networking tested from a clean boot" is retained somewhere except for in my own head, where it is prone to bit rot over a period of years.

...

- Why IPv6 broke for everyone even though it should have failed over to another router.

This bit I am still looking into. Cheers, Andy -- http://bitfolk.com/ -- No-nonsense VPS hosting "I am the permanent milk monitor of all hobbies!" — Simon Quinlank _______________________________________________ announce mailing list announce(a)lists.bitfolk.com https://lists.bitfolk.com/mailman/listinfo/announce

Dom Latter

28 Nov 28 Nov

11:45 p.m.

New subject: 2014-11-26 ~1432Z: Problem with kwak.bitfolk.com, and IPv6 (all hosts)

On 27/11/14 09:25, Andy Smith wrote:

...

Hello, On Wed, Nov 26, 2014 at 03:27:10PM +0000, Andy Smith wrote:

- How kwak came to be power cycled

Ah, a PISR [1] event. I'll say it again: for People Like Me, this sort of transparent and honest communication with customers, affected or not, is a *far* more effective Marketing Tool than some sort of bogus "100% SLA" [2]. One small suggestion though. As an affected customer, I didn't submit a support ticket as I figured it was almost certainly being dealt with, and me sticking my oar in would just cause unnecessary work. OTOH I didn't seem to be able to find anything on bitfolk.com telling me about it. Perhaps it's all Twitter these days? I don't know, I don't do Twitter. So perhaps a simple server status page on the website? The nearest I could find was the traffic reports that showed zero bytes in or out for kwak, so I could figure out that it wasn't just me, then. [1] Pillock In Server Room [2] Always utterly worthless, AFAICWO, unless you are into Seriously Expensive territory.

Sämi Bächler

29 Nov 29 Nov

7:47 a.m.

New subject: 2014-11-26 ~1432Z: Problem with kwak.bitfolk.com, and IPv6 (all hosts)

On 29/11/14 00:45, Dom Latter wrote:

...

On 27/11/14 09:25, Andy Smith wrote:

Hello, On Wed, Nov 26, 2014 at 03:27:10PM +0000, Andy Smith wrote:

- How kwak came to be power cycled

ed neville

8:56 a.m.

New subject: 2014-11-26 ~1432Z: Problem with kwak.bitfolk.com, and IPv6 (all hosts)

On Fri, Nov 28, 2014 at 11:45:47PM +0000, Dom Latter wrote:

...

Perhaps it's all Twitter these days? I don't know, I don't do Twitter. So perhaps a simple server status page on the website?

I don't do twitter either, but imagine PISR had done the power to a core switch, the status page may be unavailable. Bad as that may be, I still wont do twitter. On the bright side, at least those with systemd had a quicker boot. Yeah it's a troll. -- Best regards, Ed http://www.s5h.net/

john lewis

9:05 a.m.

New subject: 2014-11-26 ~1432Z: Problem with kwak.bitfolk.com, and IPv6 (all hosts)

On Fri, 28 Nov 2014 23:45:47 +0000 Dom Latter <bitfolk-users(a)latter.org> wrote:

...

On 27/11/14 09:25, Andy Smith wrote:

Hello, On Wed, Nov 26, 2014 at 03:27:10PM +0000, Andy Smith wrote:

- How kwak came to be power cycled

I didn't know there had been a problem until Andy told us, my VPS must have re-booted without any problem as I didn't get a warning message from Nagios. What is twitter? ;-) -- John Lewis Debian & the GeneWeb genealogical data server

Andy Smith

2:52 p.m.

New subject: 2014-11-26 ~1432Z: Problem with kwak.bitfolk.com, and IPv6 (all hosts)

Hi John, On Sat, Nov 29, 2014 at 09:05:38AM +0000, john lewis wrote:

...

I didn't get a warning message from Nagios.

I've checked and a "host down" email was definitely sent to you on the 26th by Nagios. It would have come from "nagios(a)bitfolk.com".com". If you didn't receive it then we really should sort that out over at support(a)bitfolk.com. Cheers, Andy -- http://bitfolk.com/ -- No-nonsense VPS hosting

john lewis

4:20 p.m.

New subject: 2014-11-26 ~1432Z: Problem with kwak.bitfolk.com, and IPv6 (all hosts)

On Sat, 29 Nov 2014 14:52:51 +0000 Andy Smith <andy(a)bitfolk.com> wrote:

...

Hi John, On Sat, Nov 29, 2014 at 09:05:38AM +0000, john lewis wrote:

I didn't get a warning message from Nagios.

Hi Andy I don't seem to have had anything from Nagios since last July. John -- John Lewis Debian & the GeneWeb genealogical data server

Ian

1:58 p.m.

New subject: 2014-11-26 ~1432Z: Problem with kwak.bitfolk.com, and IPv6 (all hosts)

Dom Latter said:

...

I'll say it again: for People Like Me, this sort of transparent and honest communication with customers, affected or not, is a *far* more effective Marketing Tool than some sort of bogus "100% SLA" [2].

Yes. From an email to Andy over five years ago: "Incidentally, one of the things that made me go, 'yes, this is the right decision' is the readable archive of the mailing list. ... Here, I can see that stuff happens occasionally, but there's an openness about it." Ian

Andy Smith

2:39 p.m.

New subject: 2014-11-26 ~1432Z: Problem with kwak.bitfolk.com, and IPv6 (all hosts)

Hi Dom, On Fri, Nov 28, 2014 at 11:45:47PM +0000, Dom Latter wrote:

...

One small suggestion though. As an affected customer, I didn't submit a support ticket as I figured it was almost certainly being dealt with, and me sticking my oar in would just cause unnecessary work.

You were right to not send a support ticket because I wouldn't have been able to answer it immediately, and then afterwards all I would have done would have been to point to the announce@ post, assuming that your service was functioning by that point. Should a customer experience total loss of service then I suppose that ideally I'd like their troubleshooting process to be something like this: ┌──────────────────────┐ │ Start │ └──────────────────────┘ │ │ ▼ ┌──────────────────┐ ┌──────────────────────┐ │ │ │ Can I reach things │ │ No need to send │ │ outside BitFolk? │ │ a support ticket │ No │ Local problems do │ ┌───▶ │ │ ◀───── │ happen :) │ │ └──────────────────┘ └──────────────────────┘ │ ▲ │ │ Yes │ │ Yes │ │ ▼ │ │ ┌──────────────────────┐ │ │ │ Can I reach the │ └───────┼───────────────────┐ │ Xen Shell? │ ─┐ │ │ └──────────────────────┘ │ │ │ │ │ │ │ │ Yes │ │ │ ▼ │ │ │ ┌──────────────────────┐ │ │ │ │ Is it a problem │ │ │ │ │ related to my │ │ │ │ │ VPS that I can │ │ │ └──── │ solve myself? │ │ │ └──────────────────────┘ │ │ │ │ │ │ No │ No │ ▼ │ │ ┌──────────────────────┐ │ │ │ Is this explained │ │ │ Yes │ by an outage posting │ │ └──────────────────────── │ on announce@? │ │ └──────────────────────┘ │ │ │ │ No │ ▼ │ ┌──────────────────────┐ │ │ Send a support │ │ │ ticket │ ◀┘ └──────────────────────┘

...

I didn't seem to be able to find anything on bitfolk.com telling me about it. Perhaps it's all Twitter these days? I don't know, I don't do Twitter. So perhaps a simple server status page on the website?

You are not the first person to have asked about this. Here's what I tend to do when there's some sort of widespread outage: ┌──────────────────────────┐ │ Verify there's a problem │ └──────────────────────────┘ │ ∨ ┌──────────────────────────┐ │ Write on IRC │ └──────────────────────────┘ │ ∨ ┌──────────────────────────┐ │ Write on twitter │ └──────────────────────────┘ │ ∨ ┌──────────────────────────┐ │ More extensive look │ └──────────────────────────┘ │ ∨ ┌──────────────────────────┐ │ Write an email │ └──────────────────────────┘ │ ∨ ┌──────────────────────────┐ │ Deal with problem │ └──────────────────────────┘ │ ∨ ┌──────────────────────────┐ │ Answer support tickets │ └──────────────────────────┘ IRC because I'm always on IRC. (#bitfolk on irc.bitfolk.com) Twitter because a lot of people use Twitter and it's pretty simple to basically copy whatever I said on IRC to there. (@bitfolk) Email tends to require slightly more thought, so I prefer to investigate things for a few minutes first. Hopefully there is some action I can take which might take a few minutes to complete and in that break I can dash off an email. So, I appreciate that people who don't use IRC/Twitter have a short period of time where they might be confused about an apparent outage but can see nothing on BitFolk's web site. This does result in some support tickets which are ignored for a short time until an email has gone out that I can link them to. There was an abortive attempt at some sort of status page, but I experienced severe aversion to updating it and it fell by the wayside. Enough people want this that I really should do it, so let's talk about that in a separate thread on this list at more length.

...

[1] Pillock In Server Room

Despite having some choice words regarding the event I do want to make clear that this sort of thing happens to us all from time to time. :) So I don't want to take it out on the person that was doing the work, and that would be the case whether it was an employee of mine or (as in this case) an employee of a supplier. As I say, we are not happy with the socket design on those PDUs which leave all of the cables decidedly wiggly at that end, and the rack is also very full, making any work a more delicate matter. This will be improved. A question I was asked is if that machine had dual PSUs. It doesn't, and that would of course have avoided the issue. The racks do have two PDUs on different power feeds. So why not? This was a business decision, and I do still think it was the right decision. We have seen very few problems with power since 2007. In fact I think we have seen about as many problems with entire rack or suite as with individual power, and in those cases the dual feeds wouldn't have helped. Firstly, kwak is one of the oldest machines still in service, a 1U server, and dual PSUs were not so common then in 1U form factor. Even so it could have been done. Newer machines do in fact have two PSUs but I only have one of them hooked up. The thing is, the cost for having dual power supplies would be at least 26% higher on a recurring basis, which isn't something I wanted to swallow nor pass on. Instead I'd rather that customers build in their own resilience and that I work to reduce downtime if it should occur. In event of dead PSU, I have spare PSUs – in the same host in the case of newer hardware. In event of dead PDUs, the colo provider has spare PDUs. In event of complete server death I have spare hardware. There will obviously be downtime associated with swapping this stuff out, but it is at least bounded. I appreciate that on the very next power issue these words will haunt me, but those are the facts: it was a business decision because I can't absorb or pass on a 26% increase in colo fees. Even so, it is possible this may change with the next round of hardware upgrades because I hope to put larger numbers of customers on fewer pieces of hardware. In that case a simple mistake like this won't just affect 38 customers like it did with kwak. Cheers, Andy -- http://bitfolk.com/ -- No-nonsense VPS hosting

3779

days inactive

3782

days old

users@mailman.bitfolk.com

Manage subscription

9 comments

6 participants

tags (0)

participants (6)

Andy Smith
Dom Latter
ed neville
Ian
john lewis
Sämi Bächler