Chat with us, powered by LiveChat

Blog

Back

Feb 1 Office 365 Outage Incident Review released

12 Feb 2013 by Emma Robinson

Microsoft have released their incident review for the global outage that affected multiple Microsoft services on Feb 1 2013.  Below is the full review.

Office 365 Customer Ready Post Incident Review

 
Incident Information

Introduction This Post Incident Review (PIR) is a consolidation for four (4) separate Incident IDs that were posted to the Service Health Dashboard on February 1, 2013.  Because each incident had a common root cause and the same set of next steps this common PIR will be posted under each of the Incident IDs.
Incident ID EX2764, SP2765, IS2766, MO2768
Incident Title Access to Office365 Services impacted for some customers
Service(s) Impacted Exchange Online, SharePoint Online, Identity and Administrative Services

 
Summary
On February 1, 2013, at 2:03 PM UTC, some Microsoft Office 365 customers across multiple geographic regions started to experience issues accessing the service.  This issue was tied to an update made in the Microsoft network which caused incorrect routing for a portion of the inbound internet traffic.  Once the root cause was identified restoration activities began and customers started seeing improvement beginning at 3:35 PM UTC.  Full service restoration was achieved at 3:55 PM UTC with the exception of some Latin American customers; full resolution was done by 4:35 PM UTC.  A timeline of events follows:
 
2:03 PM UTC – Update to network initiated
2:15 PM UTC – Analysis of alert commences
2:15 PM – 3:15 PM UTC – Underlying root cause tied to network update
3:15 PM – 3:35 PM UTC – Emergency rollback procedure defined and implemented
3:35 PM – 3:55 PM UTC – Service restored to all customers except for some in Latin America
4:35 PM UTC – Services for all regions including Latin America restored
 
Customer Impact
Customers who were unable to reach the service would have experienced functional loss of Exchange and SharePoint services.  Customers would have also been unable to access administrative services including the Service Health Dashboard (SHD).  Lync Online was unaffected by this issue.
 
Incident Start Date and Time
February 1, 2013, at 2:03 PM UTC
 
Date and Time Service was Restored
February 1, 2013, at 4:35 PM UTC
 
Root Cause
A routine change to the Microsoft Online edge network was applied incorrectly and caused some traffic from the internet to not reach the Microsoft Office 365 service.  Although the core services were available, access was limited.
 
Next Steps

Issue

Next Steps/Findings/Action Taken

Team Owner

Status

Network change had unexpected impact. Findings:  A procedural error was identified in the propagation of an update to the network edge devices which caused the failure.Actions:  TheStandard Operating Procedure has been updated to better test and validate these type of changes. Microsoft Online Network Operations

Complete

Network change was automatically propagated to multiple regions. Findings:  The change was deployed in an automated manner which pushed out the update across multiple devices within the Microsoft Online Network.Actions:  Process update made to ensure initial propagation occurs to a single device and traffic flow is validated prior to automation deploying more broadly. Microsoft Online Network Operations

Complete

Provide customers with more accessible alternative to the SHD for service status. Findings:  Although Microsoft provides customers with an alternative to the SHD in the event the primary site is inaccessible, this process did not work as well as expected.Actions:  Identified improvements in the end to end Service Health Dashboard solution.  Improve the backup notification system which provides customers an easy alternative for viewing service notifications. Microsoft O365 Engineering and Operations

May 1, 2013

Improved monitoring for more rapid response and recovery. Findings:  Because the service event occurred with internet access and there were no failed physical or application components, issue determination and remediation were delayed.Action Taken:  This is an area of continuous improvement to reduce the time for issue determination and remediation. Microsoft O365 Engineering

On going