Hi,
For just short of 5 hours on Friday 29 April we lost connectivity to
80.0.0.0/8 i.e. every IP address that starts with 80.
An email sent at the time:
On Fri, Apr 29, 2016 at 10:58:07AM +0000, Andy Smith wrote:
So the list of affected prefixes so far is:
80.1.0.0/16
80.7.0.0/16
80.40.0.0/13
80.68.80.0/20
80.95.96.0/19
80.192.0.0/14
80.229.0.0/16
80.238.0.0/21
212.110.160.0/19
The last one now appears to have been listed in error and was
actually unaffected. The extent of the outage was of course all IP
destinations in 80/8 and not just the ones reported above.
This outage was the result of route filtering by our transit
provider Jump Networks. A more detailed explanation of how that
happened has been provided by Jump and is included below.
Jump's route filtering is controversial and has in the past caused
us (relatively minor) problems when a network removes a covering
route. It's something I've been putting pressure on Jump to improve
so I'm glad to see further development being put into it.
I am sometimes asked why we are single-homed to Jump and not do our
own BGP. I do not see it as being single-homed, I see it as being
part of Jump's network. If we were talking about being on the end of
a wire that was connected solely to Jump then that would be a
single-homed situation that I wouldn't find reasonable. However our
connections to Jump's core network are redundant at every level, and
from there Jump has multiple transit providers and a pair of LINX
peering sessions.
It does mean however that we have to own the problems that happen with
Jump's network when they affect us and not just try to pass it off as
being "a fault with a supplier". I apologise for this outage and any
disruption that it caused you.
On the whole I am happy with Jump's network and don't think I could
do a better job in replicating it. I suspect that if I tried then
what we'd end up with is something that overall has less
reliability. The route filtering matter is the only thing I have
taken issue with; I will continue to push for improvements in
this area to reduce the possibility of any further problems of this
nature.
I appreciate there's some fairly technical details in this and the
explanation below so if any of it is unclear then do feel free to
ask further questions.
Regards,
Andy Smith
Director
BitFolk Ltd
--
http://bitfolk.com/ -- No-nonsense VPS hosting
Please consider the environment before reading this e-mail.
— John Levine
From: "James A. T. Rice"
Date: Fri, 29 Apr 2016 23:42:48 +0100
Subject: [jump-announce] Connectivity loss from Jump to 80.0.0.0/8 earlier today
Connectivity loss from Jump to 80.0.0.0/8
=========================================
What happened
-------------
On 2016-04-29 from 0611 UTC until 1053 UTC we suffered a loss of
connectivity to the IPv4 80.0.0.0/8 range. IPv6 was unaffected.
The set of circumstances leading to this was:
1) On 2016-04-27 at 1318 UTC, Belgacom, AS6774 (Belgium's largest
telecommunications company) started announcing 80.0.0.0/8 into the global
BGP routing tables. Belgacom have no authority to announce 80.0.0.0/8
2) Despite Belgacom having no authority to announce 80.0.0.0/8, our
upstreams, NTT, Level3, and GlobalCrossing, trusted these announcements
from Belgacom, accepting these routes into their routing tables.
Our upstreams normally have fairly strict route filters, however Belgacom
appears to have been one of their 'trusted' peers that these filters
weren't applied to.
3) Jump in turn has trusted NTT/Level3/GlobalCrossing's route filters to
give us an authoritative view of what's supposed to be in the global
routing tables.
In this case the trust was too much, and we accepted 80.0.0.0/8, and with
our subsequent filter rebuilds stopped accepting de-aggregates of that
prefix.
We do have sanity checking in place in our filter building to help prevent
illegitimate aggregates like this reducing the size of our accepted prefix
list too much, however there were only ~1500 prefixes announced inside
80.0.0.0/8 and this reduction wasn't beyond the sanity check limits we
have.
4) On 2016-04-29 at 0611 UTC, Belgacom stopped announcing the illegitimate
80.0.0.0/8 prefix. Since Jump was no longer accepting the de-aggregates
within that range, we lost connectivity to 80.0.0.0/8.
We'd hope to have been alerted right away by our nlnog ring node, which
monitors when other nlnog ring nodes become unreachable, however out of
389 nodes, only 6 were in 80.0.0.0/8, which wasn't sufficient to trigger
a problem report.
5) At 1053 UTC a manually triggered filter rebuild at Jump restored
connectivity to prefixes within 80.0.0.0/8.
NB Filter rebuilds normally happen three times a day, outside of working
hours, at 0030 UTC, 0730 UTC, and 1830 UTC.
Prevention
----------
1) Belgacom shouldn't misconfigure their BGP
Unfortunately we can't control this.
2) NTT/Level3/GlobalCrossing shouldn't trust Belgacom so much
Unfortunately we can't control this.
3) We shouldn't trust NTT/Level3/GlobalCrossing so much
It is suboptimal that illegitimate routing announcements and
misconfigurations at other ISPs can cause trouble for us - this is one of
the reasons we have the route filter system, as it almost eliminates the
risk of us accepting BGP prefix hijacks - hijacks typically involve
announcing a more specific prefix of a legitimate BGP route. However we're
keen to also prevent the problems we've currently 'traded' for the
problems the prefix filtering prevents.
We do have a developer working on the BGP route filtering system, and one
of the milestones already set is for 'genuine / necessary' routes to
persist in the allow list for 9 months after it was last necessary, which
will prevent situations like this. This will be in place in the next few
weeks.
I hope this explains the problems, my apologies for any inconvenience this
caused.
Sincerely
James Rice
Director
Jump Networks Ltd
_______________________________________________
announce mailing list
announce(a)lists.bitfolk.com
https://lists.bitfolk.com/mailman/listinfo/announce