Hi Joseph,
On Tue, Jun 26, 2012 at 11:14:43AM +0100, Joseph Heenan wrote:
here's the graph for my > VPS:
http://f8lure.mouselike.org/archived_graphs/button.heenan.me.uk_day25.png
The number of huge spikes (and some packet loss, shown red) on this
surprised me. Would this kind of result be expected?
About a month ago I was made aware of a problem with occasional
spikes of high latency, and on looking into it, it became apparent
that it had actually been the case for a long time - perhaps years -
without anyone really noticing.
What you're seeing is one or two packets out of every couple of
hundred being delayed somewhere, sometimes for hundreds of
milliseconds.
It isn't restricted to your VPS, or to any one BitFolk server. It
seems to be affecting all VPSes, but as I say, it has been doing so
for a very long time. Here's a graph that exemplifies the issue:
http://www.thinkbroadband.com/ping/share/9b7cf0ba2197b53c0aeb0f3cff42fb7e.h…
Since then I've been trying to work out where it's actually
happening, and this has been a long and ongoing process.
Firstly, it *is* restricted to BitFolk. Other things hosted at the
same colo are not seeing it. Here's something else in the same rack
as some BitFolk nodes, connected to the same switches:
http://www.thinkbroadband.com/ping/share/cc1418a68757c0f78c674ca6cd0beabe.h…
That lead me to wonder if it could be some form of overloading of
BitFolk's VM hosting nodes. I feel like I have by this point
discounted that possibility though, because I have been emptying off
the node "curacao" to the point where it now has just two VPSes left
on it, one of which is the "pingtest" VPS above, which still shows
the issue. So it's hard to believe that it can be overloading.
Then I wondered about proxy ARP. I worked with our colo provider to
restrict the amount of IP addresses that their routers would ARP
for, and we examined packet traces for ARP activity but that proved
to be fruitless.
So next, is it a problem inherent to Xen? Well, the "penguin" graph
above is a Xen-based VPS running on hardware similar to BitFolk's,
which was set up by me in a virtually-identical way to how I set up
BitFolk's nodes, and it doesn't show the problem.
That's where we are at the moment, and I'm continuing to work on
this. By tomorrow I'll have moved the last remaining customer off of
curacao and then I'll move that node into a different VLAN with
other (non-BitFolk) servers that aren't currently experiencing this
problem, to see what happens.
I'm afraid I can't give you any ETA on when this might be fixed as I
still don't know exactly what the problem is. I will keep you
informed of progress.
Cheers,
Andy
--
http://bitfolk.com/ -- No-nonsense VPS hosting