Early May 2019, Microsoft suffered from an outage which left many customers unable to connect to Office 365 or (some) Azure services. At the root of the issue lies a faulty DNS configuration which surfaced in an effort to move DNS services in-house at Microsoft.
Whilst I’m sure that Microsoft took all necessary precautions to avoid issues, I found one thing in the Post-Incident Report (PIR) very interesting: Apparently, Microsoft did not pick up on the outage until after it was reported by customers. Guessing at why that may be is because the have a heavy inside-focused approach to monitoring.
Indeed, from an “inside”-perspective, Office 365 was working just fine. However, externally not so much. As the PIR states: Microsoft will review their monitoring configuration as a result of this outage.
So, as you can see: every outage has its learnings! Should you want to learn morning, head over to the ENow blog where I’ve written a little more on the topic.