Hi Sam,
On Mon, Nov 23, 2015 at 11:21:54PM +0100, Samuel Bächler wrote:
I logged into my vps around 21:00 CET today using ssh.
When I typed the
command *ls* it took quite a while (5 to 20 seconds - these things are
hard to tell when one does not measure it with a clock).
With the delay on the output of "ls" one might first suspect IO slow
downs, but looking at the graph for sol:
http://tools.bitfolk.com/cacti/graph_3744.html
(link to zoomable version is on your cacti interface)
…no such issue can really be seen.
I do also know that there were major issues with Level3, one of
Jump's transit providers, starting at around 1949Z.
I had a bunch of queries come in over Twitter¹ around that time as
to why they found BitFolk unreachable. It wasn't totally unreachable
for everyone, just the portion of the planet that was trying to
reach it via Level3, which is still quite a big part. :)
So anyway, I asked Jump, and some time later Jump confirmed that
they had seen Level3 problems and had shut off their port to Level3
at 1959Z in order to force traffic to come by other paths. Some time
later they saw Level3 reboot their side.
When I tried all the above mentioned things again a
few minutes later
things were back to normal.
Now, what could that have been?
Based on the problems seen with Level3, and Jump's response to it,
that would explain why things recovered a few minutes after 1959Z.
When encountering problems like this it is good to do an mtr (or a
traceroute) and note the path that is taken; if that had shown
Level3's network then we'd have a lot more confidence that that was
the root cause, although it still seems pretty likely.
Here's what an mtr looked like towards BitFolk during the problems:
HOST: budvar Loss% Snt Last Avg Best Wrst StDev
1. 174.136.109.65 0.0% 20 0.8 3.1 0.7 13.3 3.9
2.
s7.lax.arpnetworks.com 0.0% 20 0.8 0.6 0.4 1.1 0.2
3. vlan953.car2.LosAngeles1.Lev 0.0% 20 105.6 54.8 0.4 185.7 66.0
4. ae-27-27.edge6.LosAngeles1.L 0.0% 20 2.6 1.4 0.4 15.2 3.3
5. vlan90.csw4.LosAngeles1.Leve 0.0% 20 6.8 0.9 0.4 6.8 1.4
6. ae-4-90.edge3.LosAngeles1.Le 95.0% 20 0.7 0.7 0.7 0.7 0.0
7. vlan70.csw2.LosAngeles1.Leve 0.0% 20 0.4 0.6 0.4 1.6 0.3
8. ae-2-70.edge1.LosAngeles9.Le 95.0% 20 0.6 0.6 0.6 0.6 0.0
9. vlan70.csw2.LosAngeles1.Leve 0.0% 20 0.5 0.6 0.5 1.0 0.1
10. ??? 100.0 20 0.0 0.0 0.0 0.0 0.0
11. vlan80.csw3.LosAngeles1.Leve 10.0% 20 1.7 0.7 0.5 1.7 0.4
12. ae-3-80.edge5.LosAngeles1.Le 0.0% 20 0.7 0.7 0.5 1.6 0.2
13. vlan80.csw3.LosAngeles1.Leve 55.0% 20 0.5 0.7 0.5 0.9 0.1
14. ae-3-80.edge5.LosAngeles1.Le 0.0% 20 0.5 0.9 0.5 4.6 0.9
15. ??? 100.0 20 0.0 0.0 0.0 0.0 0.0
16. ae-3-80.edge5.LosAngeles1.Le 0.0% 20 0.7 1.5 0.5 16.1 3.4
17. vlan80.csw3.LosAngeles1.Leve 95.0% 20 0.5 0.5 0.5 0.5 0.0
18. ae-3-80.edge5.LosAngeles1.Le 20.0% 20 0.6 0.7 0.5 1.1 0.1
19. vlan80.csw3.LosAngeles1.Leve 90.0% 20 2.7 1.8 0.8 2.7 1.4
20. ae-3-80.edge5.LosAngeles1.Le 55.0% 20 2.1 1.0 0.6 2.1 0.5
21. vlan80.csw3.LosAngeles1.Leve 20.0% 20 1.0 0.9 0.6 2.2 0.4
22. ae-3-80.edge5.LosAngeles1.Le 85.0% 20 0.7 0.8 0.7 0.9 0.1
23. vlan80.csw3.LosAngeles1.Leve 0.0% 20 0.8 0.9 0.6 2.8 0.5
24. ae-3-80.edge5.LosAngeles1.Le 95.0% 20 0.9 0.9 0.9 0.9 0.0
25. vlan80.csw3.LosAngeles1.Leve 5.0% 20 1.7 0.9 0.6 2.1 0.4
26. ae-3-80.edge5.LosAngeles1.Le 95.0% 20 0.9 0.9 0.9 0.9 0.0
27. vlan80.csw3.LosAngeles1.Leve 0.0% 20 0.9 0.9 0.7 2.6 0.4
28. ??? 100.0 20 0.0 0.0 0.0 0.0 0.0
29. vlan80.csw3.LosAngeles1.Leve 0.0% 20 0.7 1.5 0.7 12.4 2.7
30. ??? 100.0 19 0.0 0.0 0.0 0.0 0.0
and after 1959Z:
HOST: budvar Loss% Snt Last Avg Best Wrst
StDev
1. 174.136.109.65 0.0% 20 8.1 7.7 0.8 23.8
7.5
2.
s7.lax.arpnetworks.com 0.0% 20 0.8 6.1 0.4 64.6
15.5
3.
ge-100-0-0-13.r00.lsanca07.us.bb.gin.ntt.net 0.0% 20 0.9 1.2 0.7 3.8
0.8
4.
ae-2.r23.lsanca07.us.bb.gin.ntt.net 35.0% 20 0.7 1.9 0.6 8.1
2.3
5.
ae-6.r22.asbnva02.us.bb.gin.ntt.net 0.0% 20 69.2 73.7 68.9 117.9
14.1
6.
ae-0.r23.asbnva02.us.bb.gin.ntt.net 0.0% 20 67.6 68.0 67.5 70.3
0.7
7.
ae-2.r23.amstnl02.nl.bb.gin.ntt.net 5.0% 20 149.7 148.9 147.3 154.7
2.2
8.
ae-0.r22.amstnl02.nl.bb.gin.ntt.net 0.0% 20 172.5 159.0 154.4 207.4
12.3
9.
ae-5.r23.londen03.uk.bb.gin.ntt.net 0.0% 20 153.4 156.8 153.4 171.7
5.8
10.
ae-7.r00.londen10.uk.bb.gin.ntt.net 0.0% 20 161.2 161.2 160.3 162.1
0.4
11. vl367-ntt-thn-gw-sup-tfm1.jump.net.uk 0.0% 20 155.1 156.4 154.2 172.6
4.3
12.
jack.bitfolk.com 0.0% 20 163.2 162.8 162.2 163.6
0.5
13.
bitfolk.com 0.0% 20 155.0 155.5 154.1 163.0
1.9
As an aside, these are both bad report formats as they don't show IP
addresses. I include them only because it more obviously shows
Level3 and NTT.
If sending an mtr report to someone it is best to use the -n option
to disable DNS lookup, as it is quite common for hops to have
reverse DNS but no matching forward DNS.
e.g. if you have a hop like:
vlanXX.xyz.abc.example.com
that does not resolve to an IP address then I can guess that maybe
provider.com is who I need to ask/blame, but it might actually be
the side of a connection the belongs to some other provider. Whereas
if I know the IP address I can more easily associate it with some
company.
Also the person you're sending the report to needs to know the IP
addresses of both ends. The first report doesn't include that; the
second one does at least include the destination of "bitfolk.com"
which will resolve in DNS but it would be nice to skip that step.
Cheers,
Andy
¹
https://twitter.com/bitfolk
--
http://bitfolk.com/ -- No-nonsense VPS hosting