Microsoft’s latest cloud authentication outage: What went wrong

azureoutages.jpg

Credit: Microsoft

Microsoft has released a preliminary analysis of the cause of the Azure Active Directory disruption, which removed Office, Teams, Dynamics 365, Xbox Live, and other Microsoft and third-party programs that rely on Azure AD for authentication. The 14-hour outage affected a “portion” of Microsoft customers worldwide, officials said.

Microsoft’s preliminary analysis of the incident, published on March 16, indicated that an error had occurred in the rotation of keys used to support Azure AD’s use of OpenID, and other identity standard protocols for cryptographic signing operations’, according to the findings published Page of Azure Status History.

Officials said that as part of normal security practices, an automated system removes keys that are no longer in use, but in recent weeks a key has been marked as ‘retained’ for longer than normal to support a complex migration between clouds. This resulted in an error being exposed that caused the key that was removed to be removed. Metadata about the signing keys is published by Microsoft worldwide. But as soon as the metadata changed around 15:00 ET (the start of the interruption, applications using these protocols in Azure AD started picking up the new metadata and no longer trusted tokens / claims signed with the deleted key).

Microsoft engineers restored the system to its previous state around 5:00 PM ET, but it takes a while for applications to retrieve the rolled back metadata and refresh it with the correct metadata. A subset of storage resources required an update to invalidate the wrong entries and force a refresh.

Microsoft’s report explains that Azure AD is undergoing a multi-phase effort to apply additional protection to the Back-end Safe Deployment Process to prevent such issues. The component for the removal key is in the second phase of the process, which is only completed half a year. Microsoft officials said the disruption of the Azure AD verification that occurred in late September was part of the same class of risks they believe will be circumvented after the multi-phase project is completed.

“We understand how incredibly impactful and unacceptable it is and sincerely apologize. We are constantly taking steps to improve the Microsoft Azure platform and our processes to ensure that such incidents do not occur in the future,” the blog post said. .

A full causal analysis will be published once the investigation is complete, officials said.

Source