11 Dec 2019 by Mike Weaver
Integration: The Final Step in Change Management
The final step in successful change management is the Integration stage. Here’s how to bring everything together. Watch now.
Microsoft have released their incident review for the global outage that affected multiple Microsoft services on Feb 1 2013. Below is the full review.
Office 365 Customer Ready Post Incident Review
|Introduction||This Post Incident Review (PIR) is a consolidation for four (4) separate Incident IDs that were posted to the Service Health Dashboard on February 1, 2013. Because each incident had a common root cause and the same set of next steps this common PIR will be posted under each of the Incident IDs.|
|Incident ID||EX2764, SP2765, IS2766, MO2768|
|Incident Title||Access to Office365 Services impacted for some customers|
|Service(s) Impacted||Exchange Online, SharePoint Online, Identity and Administrative Services|
On February 1, 2013, at 2:03 PM UTC, some Microsoft Office 365 customers across multiple geographic regions started to experience issues accessing the service. This issue was tied to an update made in the Microsoft network which caused incorrect routing for a portion of the inbound internet traffic. Once the root cause was identified restoration activities began and customers started seeing improvement beginning at 3:35 PM UTC. Full service restoration was achieved at 3:55 PM UTC with the exception of some Latin American customers; full resolution was done by 4:35 PM UTC. A timeline of events follows:
2:03 PM UTC – Update to network initiated
2:15 PM UTC – Analysis of alert commences
2:15 PM – 3:15 PM UTC – Underlying root cause tied to network update
3:15 PM – 3:35 PM UTC – Emergency rollback procedure defined and implemented
3:35 PM – 3:55 PM UTC – Service restored to all customers except for some in Latin America
4:35 PM UTC – Services for all regions including Latin America restored
Customers who were unable to reach the service would have experienced functional loss of Exchange and SharePoint services. Customers would have also been unable to access administrative services including the Service Health Dashboard (SHD). Lync Online was unaffected by this issue.
Incident Start Date and Time
February 1, 2013, at 2:03 PM UTC
Date and Time Service was Restored
February 1, 2013, at 4:35 PM UTC
A routine change to the Microsoft Online edge network was applied incorrectly and caused some traffic from the internet to not reach the Microsoft Office 365 service. Although the core services were available, access was limited.
Next Steps/Findings/Action Taken
|Network change had unexpected impact.||Findings: A procedural error was identified in the propagation of an update to the network edge devices which caused the failure.Actions: TheStandard Operating Procedure has been updated to better test and validate these type of changes.||Microsoft Online Network Operations||
|Network change was automatically propagated to multiple regions.||Findings: The change was deployed in an automated manner which pushed out the update across multiple devices within the Microsoft Online Network.Actions: Process update made to ensure initial propagation occurs to a single device and traffic flow is validated prior to automation deploying more broadly.||Microsoft Online Network Operations||
|Provide customers with more accessible alternative to the SHD for service status.||Findings: Although Microsoft provides customers with an alternative to the SHD in the event the primary site is inaccessible, this process did not work as well as expected.Actions: Identified improvements in the end to end Service Health Dashboard solution. Improve the backup notification system which provides customers an easy alternative for viewing service notifications.||Microsoft O365 Engineering and Operations||
May 1, 2013
|Improved monitoring for more rapid response and recovery.||Findings: Because the service event occurred with internet access and there were no failed physical or application components, issue determination and remediation were delayed.Action Taken: This is an area of continuous improvement to reduce the time for issue determination and remediation.||Microsoft O365 Engineering||