[atnog] AMSIX Heute

Rene Avi rene.avi at nextlayer.at
Thu May 21 09:57:53 CEST 2015


Hier der Abschlussbericht zum AMS-IX Ausfall. Kurzform: 100G-Ports zu Sub-Switch deprovisioniert aber in alter config (Peering-VLAN), Field-Engineer zu rasch unterwegs bzw. nicht vom NOC aufgehalten worden mit einem Loop.

500 von 600 sessions passt ca, wir hatten auch etwa die Quote an dropped sessions dort.

Cheers, /Rene

Von: Konstantinos Koutalis <konstantinos.koutalis at ams-ix.net<mailto:konstantinos.koutalis at ams-ix.net>>
Datum: Wednesday 20 May 2015 23:37
An: Tech-l mail list <tech-l at ams-ix.net<mailto:tech-l at ams-ix.net>>
Betreff: Outage post-mortem: 13 May 2015, 100GE loop on AMS-IX ISP Peering LAN.

Dear Members and Customers

As a follow up on the issue we had last week Wednesday May 13, 2015 at 12:20 we like to share what happened and what measures we have to taken to limit the possibility of a similar event happening in the future.

The following is the sequence of events as we can recall them from different log files:

- 00:00 - 01:00, all customers were moved away from stub-eq3-233,
         as announced in "#174450, Provisioning of new customer ports at EQUINIX-AM3".
         The backbone links on the emptied PE were not disabled.
- 10:39 - 11:00, 100GE modules were replaced on stub-eq3-233
- 11:11 - 11:14, engineers started placing physical loops to
         test the newly installed modules & interfaces. Due to
         a miscommunication between the engineers on site and
         the NOC, the ports to be tested were still in the Peering
         LAN VPLS Instance.
- 12:22,  Interfaces were enabled using a script meant for testing new ports,
         and test traffic was generated.
- 12:25,  As 4* 100GE ports were looped and still in Peering LAN
         VPLS and no L2 ACL was in place, broadcast traffic over
         the loop makes customer router MAC addresses to show up
         behind the looped interfaces and attract traffic to
         these MACs.
         Linecards on all switches started having CPU spikes.
         Approx. 500 out of 600 BGP sessions went down on the
         Route Servers.
- 12:29,  NOC disabled the looped interfaces & the backbone links
         of stub-eq3-233.
- 12:40,  Approx. 500 BGP sessions on the route servers were back up


Analysing the events has shown that there were two flaws in our
procedures that made the outage possible.

1: Blind reliability on the correct functioning of scripts to
  configure the switches into a state where this work could
  be executed.
2: Miscommunication between the on site engineer and the NOC
  engineer overlooking the state of the platform.

The measures we have taken define a more strict communication
between the engineers doing on-site work and the NOC engineer
looking at the state of the platform and the configuration changes
necessary to execute the work. We also defined more clearly who
is responsible for the configuration part while maintenance is
going on.

During the incident, the AMS-IX NOC being focused in ensuring that the stability of the platform was restored,
neglected to send a notification to Tech-L on time informing all technical contacts about the outage.
After reviewing our internal procedures we have updated them to ensure that, in any similar incident,
the NOC engineer working with our on-site engineers will immediately notify all AMS-IX peering parties.

Once again, we sincerely apologize for any inconvenience caused by that outage.

Kind regards,

Kostas (Konstantinos) Koutalis

NOC Manager
Amsterdam Internet Exchange (AMS-IX)


Von: Jürgen Jaritsch <jj at anexia.at<mailto:jj at anexia.at>>
Datum: Wednesday 13 May 2015 21:12
An: Christoph Loibl <c at tix.at<mailto:c at tix.at>>, Jürgen Jaritsch <jj at anexia.at<mailto:jj at anexia.at>>, "klaus.darilion at nic.at<mailto:klaus.darilion at nic.at>" <klaus.darilion at nic.at<mailto:klaus.darilion at nic.at>>
Cc: "atnog at mailing.atnog.at<mailto:atnog at mailing.atnog.at>" <atnog at mailing.atnog.at<mailto:atnog at mailing.atnog.at>>
Betreff: Re: [atnog] AMSIX Heute

Servus die Runde,

AMS-IX setzt doch genauso wie France-IX auf eine VPLS Implementierung? Da kommt man mit STP nicht weit.

Unabhängig davon: führt solche Tests setzt man den Port in L3 mode und fertig ...

Viele Grüße


Jürgen Jaritsch
Head of Network & Infrastructure

ANEXIA Internetdienstleistungs GmbH

Telefon: +43-5-0556-300
Telefax: +43-5-0556-500

E-Mail: jj at anexia.at<mailto:jj at anexia.at>
Web: http://www.anexia.at

Anschrift Hauptsitz Klagenfurt: Feldkirchnerstraße 140, 9020 Klagenfurt
Geschäftsführer: Alexander Windbichler
Firmenbuch: FN 289918a | Gerichtsstand: Klagenfurt | UID-Nummer: AT U63216601


-----Original Message-----
From: Klaus Darilion [klaus.darilion at nic.at<mailto:klaus.darilion at nic.at>]
Received: Mittwoch, 13 Mai 2015, 14:05
To: Christoph Loibl [c at tix.at<mailto:c at tix.at>]; Jürgen Jaritsch [jj at anexia.at<mailto:jj at anexia.at>]
CC: atnog at mailing.atnog.at<mailto:atnog at mailing.atnog.at> [atnog at mailing.atnog.at<mailto:atnog at mailing.atnog.at>]
Subject: Re: [atnog] AMSIX Heute


Am 13.05.2015 um 16:12 schrieb Christoph Loibl:
> while testing one of the newly installed 100GE modules, accidentally placed a loop on the ISP peering VLAN

Sollte STP so etwas nicht verhindern? Auf den Cisco und Juniper Switchen
die ich kenne lässt sich zB STP gar nicht deaktivieren. Oder verwendet
man auf Exchanges andere Switches ohne STP?

lg
Klaus
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://atnog.at/pipermail/atnog/attachments/20150521/4fe16582/attachment.html>


More information about the atnog mailing list