Hi Dom,
On Fri, Nov 28, 2014 at 11:45:47PM +0000, Dom Latter wrote:
One small suggestion though. As an affected customer,
I didn't
submit a support ticket as I figured it was almost certainly
being dealt with, and me sticking my oar in would just cause
unnecessary work.
You were right to not send a support ticket because I wouldn't have
been able to answer it immediately, and then afterwards all I would
have done would have been to point to the announce@ post, assuming
that your service was functioning by that point.
Should a customer experience total loss of service then I suppose
that ideally I'd like their troubleshooting process to be something
like this:
┌──────────────────────┐
│ Start │
└──────────────────────┘
│
│
▼
┌──────────────────┐ ┌──────────────────────┐
│ │ │ Can I reach things │
│ No need to send │ │ outside BitFolk? │
│ a support ticket │ No │ Local problems do │
┌───▶ │ │ ◀───── │ happen :) │
│ └──────────────────┘ └──────────────────────┘
│ ▲ │
│ Yes │ │ Yes
│ │ ▼
│ │ ┌──────────────────────┐
│ │ │ Can I reach the │
└───────┼───────────────────┐ │ Xen Shell? │ ─┐
│ │ └──────────────────────┘ │
│ │ │ │
│ │ │ Yes │
│ │ ▼ │
│ │ ┌──────────────────────┐ │
│ │ │ Is it a problem │ │
│ │ │ related to my │ │
│ │ │ VPS that I can │ │
│ └──── │ solve myself? │ │
│ └──────────────────────┘ │
│ │ │
│ │ No │ No
│ ▼ │
│ ┌──────────────────────┐ │
│ │ Is this explained │ │
│ Yes │ by an outage posting │ │
└──────────────────────── │ on announce@? │ │
└──────────────────────┘ │
│ │
│ No │
▼ │
┌──────────────────────┐ │
│ Send a support │ │
│ ticket │ ◀┘
└──────────────────────┘
I didn't seem to be able to find anything on
bitfolk.com telling
me about it. Perhaps it's all Twitter these days? I don't know,
I don't do Twitter.
So perhaps a simple server status page on the website?
You are not the first person to have asked about this.
Here's what I tend to do when there's some sort of widespread
outage:
┌──────────────────────────┐
│ Verify there's a problem │
└──────────────────────────┘
│
∨
┌──────────────────────────┐
│ Write on IRC │
└──────────────────────────┘
│
∨
┌──────────────────────────┐
│ Write on twitter │
└──────────────────────────┘
│
∨
┌──────────────────────────┐
│ More extensive look │
└──────────────────────────┘
│
∨
┌──────────────────────────┐
│ Write an email │
└──────────────────────────┘
│
∨
┌──────────────────────────┐
│ Deal with problem │
└──────────────────────────┘
│
∨
┌──────────────────────────┐
│ Answer support tickets │
└──────────────────────────┘
IRC because I'm always on IRC. (#bitfolk on
irc.bitfolk.com)
Twitter because a lot of people use Twitter and it's pretty simple
to basically copy whatever I said on IRC to there. (@bitfolk)
Email tends to require slightly more thought, so I prefer to
investigate things for a few minutes first. Hopefully there is some
action I can take which might take a few minutes to complete and in
that break I can dash off an email.
So, I appreciate that people who don't use IRC/Twitter have a short
period of time where they might be confused about an apparent outage
but can see nothing on BitFolk's web site. This does result in some
support tickets which are ignored for a short time until an email
has gone out that I can link them to.
There was an abortive attempt at some sort of status page, but I
experienced severe aversion to updating it and it fell by the
wayside.
Enough people want this that I really should do it, so let's talk
about that in a separate thread on this list at more length.
[1] Pillock In Server Room
Despite having some choice words regarding the event I do want to
make clear that this sort of thing happens to us all from time to
time. :) So I don't want to take it out on the person that was doing
the work, and that would be the case whether it was an employee of
mine or (as in this case) an employee of a supplier.
As I say, we are not happy with the socket design on those PDUs
which leave all of the cables decidedly wiggly at that end, and the
rack is also very full, making any work a more delicate matter.
This will be improved.
A question I was asked is if that machine had dual PSUs. It doesn't,
and that would of course have avoided the issue. The racks do have
two PDUs on different power feeds. So why not?
This was a business decision, and I do still think it was the right
decision. We have seen very few problems with power since 2007. In
fact I think we have seen about as many problems with entire rack or
suite as with individual power, and in those cases the dual feeds
wouldn't have helped.
Firstly, kwak is one of the oldest machines still in service, a 1U
server, and dual PSUs were not so common then in 1U form factor.
Even so it could have been done.
Newer machines do in fact have two PSUs but I only have one of them
hooked up.
The thing is, the cost for having dual power supplies would be at
least 26% higher on a recurring basis, which isn't something I
wanted to swallow nor pass on. Instead I'd rather that customers
build in their own resilience and that I work to reduce downtime if
it should occur.
In event of dead PSU, I have spare PSUs – in the same host in the
case of newer hardware. In event of dead PDUs, the colo provider has
spare PDUs. In event of complete server death I have spare hardware.
There will obviously be downtime associated with swapping this stuff
out, but it is at least bounded.
I appreciate that on the very next power issue these words will haunt
me, but those are the facts: it was a business decision because I
can't absorb or pass on a 26% increase in colo fees.
Even so, it is possible this may change with the next round of
hardware upgrades because I hope to put larger numbers of customers
on fewer pieces of hardware. In that case a simple mistake like this
won't just affect 38 customers like it did with kwak.
Cheers,
Andy
--
http://bitfolk.com/ -- No-nonsense VPS hosting