Microsoft Interruption Caused by Overloaded Azure DNS Servers

Microsoft

Microsoft revealed that the global outage on Thursday was caused by a code error that overwhelmed the Azure DNS service and could not respond to DNS queries.

At about 5:21 p.m. EST on Thursday, Microsoft experienced a global outage that prevented users from accessing or signing in to numerous services, including Xbox Live, Microsoft Office, SharePoint Online, Microsoft Intune, Dynamics 365, Microsoft Teams, Skype, Exchange Online , OneDrive, Yammer, Power BI, Power Apps, OneNote, Microsoft Managed Desktop, and Microsoft Streams.

The service was so widespread within Microsoft’s infrastructure that even their Azure status page, which is used to provide interruption information, was inaccessible.

Blue status page unreachable
Blue status page unreachable
Source: Twitter

Microsoft finally fixed the outage at around 6:30 PM EST, and some services took a little longer to function properly again.

At the time, Microsoft said the interruption was caused by a DNA problem, but that it did not provide further information.

Azure DNS service overloaded

Last night, Microsoft published a causal analysis (RCA) for this week’s outage, explaining that it was caused by an overload of their Azure DNS service.

Microsoft’s DNS DNS is a global network of redundant name servers that provide high availability and fast DNS services.

According to Microsoft, the Azure DNS service has started receiving a ‘deviant increase’ of DNA queries from around the world targeting certain domains offered in Azure. Although Microsoft did not explain what this deviant boom was, it could be a DDoS attack targeting certain domains.

Microsoft states that their DNS service can handle a large number of requests through DNS caches and traffic generation. However, a code bug prevented their DNS Edge caches from working correctly.

“Azure DNS servers are experiencing a deviant increase in DNA queries from around the world targeting a set of domains offered on Azure. Normally, the low cache of Azure and traffic generation would mitigate this boom. In this incident, one specific series of events a code exposed defect in our DNS service that reduced the efficiency of our DNS Edge caches. “

“Because our DNS service was overloaded, DNS clients frequently tried to reload their requests, which added workload to the DNS service. Since client retries are considered legitimate DNS traffic, this traffic did not drop through our volumetric peak mitigation systems. to reduce the availability of our DNA service, ”Microsoft explained in the RCA for this week’s interruption.

Because almost all Microsoft domains are resolved by Azure DNS, it was no longer possible to resolve hostnames on these domains and access services when the DNS service was overloaded.

For example, the domain xboxlive.com uses the following Azure DNS name servers to resolve hostname on this domain.

NS1-205.AZURE-DNS.COM
NS2-205.AZURE-DNS.NET
NS3-205.AZURE-DNS.ORG
NS4-205.AZURE-DNS.INFO

Because xboxlive.com is hosted on Azure DNS, and that service became unavailable, users could no longer sign in to Xbox Live.

To prevent this type of interruption in the future, Microsoft states that they are fixing the code error in Azure DNS so that the DNS cache can handle large amounts of requests adequately. They also plan to improve the monitoring and mitigation of deviant traffic.

BleepingComputer contacted Microsoft to find out more about this deviant boom, but has not heard of it yet.

Source