Hi,
Here follows the outage report for the networking issue that occurred
on the morning of Friday 20th. If you have any questions please
feel free to ask either on or off-list; I'll answer if I can or pass
them on to James if I can't.
I understand that James has now completed his reorganisation of how
the terminal servers and masterswitches are accessed.
Cheers,
Andy
----- Forwarded message from "James A. T. Rice - Jump" -----
Hi Folks,
Apologies for the outage this morning, I'll attempt to describe what's
been found and what is being done to help prevent it happening again and
recover from something similar quicker in future.
It looks like the first signs of partial connectivity instability happened
at around 0500Z, and at around 0630Z suddenly became quite severe.
The sup-tfm1 router had disabed cef due to a malloc failure, but was still
mostly forwarding traffic and managable, the sup-tfm4 router however had
become very unstable, BGP sessions were flapping constantly, CPU was
saturated, and management of the device was impossible.
VRRP should normally take care of a dead device failover, however sup-tfm4
was still announcing itself as the preferred gateway, despite not being in
a situation to do so reliably.
The terminal servers, which speak BGP to the border routers, were mostly
accessible, as BGP to the mostly dead router was failing to fully
establish.
Reloading of sup-tfm4 was completed at about 0840Z, sup-tfm1 (generally,
the backup router) was upgraded to a newer IOS and reloaded at 1000Z.
To help prevent this happening again, the following has been done:
* Upgrade IOS on sup-tfm1 (removes a few memory leaks, fragmentation problems, etc)..
* Reduce the number of full BGP tables carried from 5 to 3 on each router.
There will be a plan to upgrade the IOS on sup-tfm4 in due course..
To help make resolution quicker in case similar situations occur, I'm
currently revamping the management devices (masterswitches, terminal
servers, etc) such that there is no dependancy on the routers for the
terminal servers to be able to access the masterswitches, as well as
sanity check all the configurations and improve them where possible.
Again, apologies for the inconvenience caused, I'm sorry we hadn't taken
the actions to streamline the management network earlier - it's been on
the todo list, please feel free to rant etc on jump-discuss, or to me, or
to support..
Thanks
James
----- End forwarded message -----
--
http://bitfolk.com/ -- No-nonsense VPS hosting
Encrypted mail welcome - keyid 0x604DE5DB