Locked out of the cloud: Lessons to learn from the MFA outage yesterday
19 Nov 2018: “Summary of impact: Between 04:39 UTC and approximately 21:30 on 19 Nov 2018, customers in Europe, Asia-Pacific, and the American regions may have experienced difficulties signing into Azure resources, such as Azure Active Directory, when Multi-Factor Authentication (MFA) is required by policy.”
Under normal circumstances, MFA is a critical piece of overall security, designed to further protect your environment from anomalous log-in activity, and potential brute force attacks. Given the number of technology services and providers that now rely on MFA as a crucial part of their login process, it’s highly unlikely that you haven’t encountered it in some format. Regardless of the method of verification, the process for MFA is fairly universal: after submitting their account credentials, a user will be prompted to prove their identity by submitting a verification code (typically sent via email or text), once this is completed successfully they will gain access to whatever portal, account details, order information that are hiding behind it this protection.
MFA is a really useful feature, one which we enforce through policy at Quadrotech. In our Azure AD environment, MFA is configured to authenticate either against the ‘Authenticator’ app or via text or email. As a result, we were heavily impacted by the outage yesterday, with many of the Quadrotech team unable to access critical services. I personally was impacted by not being able to access Teams via the mobile client.
Outages happen, they’re frustrating, and can often reinforce to both end users and admins that the control they have over their data is brokered through a service that is highly reliable but not beyond experiencing technical issues. This article doesn’t aim to throw criticism based on this incident, rather we want to look at how you can better respond to the challenges that arose here.
So what do you do when you have access to services blocked by the processes that are meant to protect them? I suppose you either disable MFA in a bulk update via the portal or PowerShell, or you disable MFA individually for those impacted (if that impact causes a loss of productivity). Bulk disabling MFA is an option but can decrease your overall security unnecessarily, as you may be removing MFA from accounts that are not experiencing the issue. On the other hand, if you disable on an ad-hoc basis, you maintain security on accounts that are not impacted, but you tie up valuable Global Admins time to disable MFA at the user-level. Tackling the individual problems becomes about who has the greatest need? For me, it did not make sense to disable my MFA because my impact was minimal.
The inherent disruption user-level MFA removal like this inevitably pulls your Admins away from other responsibilities to fire-fighting duties, where everyone claims that their issue is the most important, ‘I’m at a client meeting and can’t get to my presentation on SharePoint’, ‘I’m traveling and can’t get to any of my emails or Teams’, the scenarios go on and on. Once all the more ‘vocal’ users are addressed, how do you assess the overall impact, and get ahead of the problem for people who may not have realized yet? For example, where I was concerned, it did not make sense to disable my MFA because my impact was minimal at that point. The longer it continues, the more unmanageable end-user configuration will be, especially for large, or complex environments.
Can there be a happy medium in a situation like this? One that reduces Admin efforts and stress, without dismantling the security policies and configurations that serve to protect your environment? Sure…with Autopilot.
Autopilot allows you to delegate Office 365 and Azure AD administrative tasks. Enabling or disabling MFA is just one of those. It also allows for the creation of policies so that the system can take action on a selection of accounts. For example, you could create an Autopilot Authorization Policy to enable a non-Global Admin to disable MFA for just the users that were impacted and having a loss of productivity. This could then be delegated to your helpdesk, and when a ticket was submitted for this issue, MFA could be disabled for that user right then. However, if your entire workforce in Europe is impacted and losing productivity, an Autopilot Configuration Policy could be created for the system to disable MFA on a larger scale based on location. Your Americas operations may not be impacted, so disabling in bulk for just Europe may have been the best solution. When the outage was over, the policy could be changed to re-enable all of Europe.
Autopilot gives you greater options in situations like we had yesterday. It can allow your organization to make the best decision for you while still maintaining your security policies as much as possible in tough situations.