Managing Concentration Risk and Service Layer Outages in the Cloud: Lessons from the AWS Outage  - Node4
Skip to content

Managing Concentration Risk & Service Layer Outages in the Cloud: Lessons from the AWS Outage

IT Outsourcing: Reducing Risk & Boosting Efficiency 

When the Cloud Goes Down: Why It Matters 

Earlier this week, millions of users and businesses across the globe were reminded just how much we rely on the cloud. A major AWS outage in the US-EAST-1 region didn’t just take down a few services – it disrupted everything from HMRC and airport check-in systems to Slack, Snapchat, and Fortnite. The story made mainstream news in the UK, underscoring just how far-reaching these incidents can be.  

But here’s the uncomfortable truth: these outages aren’t rare, and they’re not unique to AWS. Over the past few years, every major cloud provider – Microsoft, Google, AWS, and others – has suffered significant service disruptions. The question isn’t if another outage will happen, but when. 

Recent Cloud Failures & Outages 

Let’s take a quick look at some of the most impactful outages from the last few years:

2025

CompanyCause
AWSDNS resolution failure in disrupted DynamoDB, affecting Snapchat, Fortnite, Coinbase, Canva, Zoom, Slack, and more.  
Microsoft 365Service degradation due to backend configuration issues impacted Teams, Outlook, and SharePoint.  
Google Cloud IAM and networking issues caused outages in Gmail, YouTube, and GCP APIs.  

2024

CompanyCause
Google Cloud FrankfurtA significant power failure caused downtime across European clients.  
CrowdStrikeA faulty update disabled millions of Windows machines globally.  
Azure Central USConfiguration failure impacted VMs, Cosmos DB, and Microsoft 365.  

2023

CompanyCause
AWS US-EAST-1Regional outage affected Slack, Zoom, and banking apps.  
AzureDNS and routing issues disrupted Microsoft 365 and Azure DevOps.  

2022

CompanyCause
Cloudflare CDNCloudflare CDN – DNS and edge server failure disrupted Google, Amazon, Facebook, and more.  
AzureAuthentication and access control issues caused global service degradation. 

2021

CompanyCause
AWSControl plane overload in US-EAST-1 disrupted EC2, Lambda, and EventBridge.  
AzureGlobal DNS outage affected Microsoft 365, Teams, and DevOps.  
AkamaiDNS misconfig caused global web outages for FedEx, Steam, and PlayStation Network.  

Why Are Cloud Outages So Disruptive? 

Public Cloud Services are a mesh of interdependent services that are largely abstracted away from customers, so if one component has an issue, such as DNS resolution for DynamoDB it can bring a whole cascade of other services down, as per the AWS outage. In fact, two of the biggest culprits of cloud failures are DNS and Centralised Management Planes (e.g. authentication. Outages of these types can be global and impactful across services. 

The real kicker? Many cloud providers use their own platforms to run their operational systems. So, a failure in one area can quickly cascade, making recovery slow and complex. 

Understanding Concentration Risk 

Concentration risk is what happens when an organisation (or even an entire sector) relies too heavily on a single cloud provider or service. In the UK, the public cloud market is dominated by just three players, making it tough to diversify and limit exposure. The data shows that outages are not only more likely, but they can also be global, impacting your core infrastructure, SaaS providers, and supply chain all at once.  

So, What Can You Do About It? 

1. Be Honest About the Trade-Offs 

Going all-in on one vendor brings speed and innovation, but it also increases risk. Make sure your business leaders understand the pros and cons and that risk tolerance is agreed up front.   

2. Embrace Hybrid and Multi-Cloud 

Hybrid and multi-cloud strategies are your best bet for resilience. Hybrid cloud, in particular, lets you bring in non-public cloud options for critical workloads. Many organisations (including Node4 clients) use hybrid cloud to increase diversity and reduce risk. It’s easier if you stick to portable, standard infrastructure like VMs or containers but even PaaS services can be run across platforms with tools like Azure Arc.  

3. Don’t Forget the Edge 

Many organisations run edge services for supported SaaS applications, ensuring that even if the cloud goes down, core business operations can continue.  

4. Plan for Recovery – But Be Realistic 

If your disaster recovery plan relies on the same vendor’s infrastructure, it might not help during a major outage. Test your plans and be transparent about what’s possible and what isn’t.  

Quick Checklist: Building Cloud Resilience 

  • Map your dependencies – know which services are critical and where they run. 
  • Diversify where it matters most (hybrid, multi-cloud, edge). 
  • Regularly review and test your disaster recovery plans. 
  • Communicate risks and trade-offs clearly to business stakeholders. 
  • Stay informed about new tools and approaches for cross-cloud resilience.

Final Thoughts 

Cloud outages are here to stay, but with the right strategy, you can manage concentration risk and keep your business running even when the unexpected happens. The technology, platforms, and expertise exist to help you build a more resilient, hybrid cloud future.

Sign up for our Hybrid Cloud Innovation Workshop

Recent cloud outages have shown just how vulnerable single-provider strategies can be. Join our Hybrid Cloud Innovation Workshop to explore how Microsoft Azure and Azure Arc can help your business stay agile, secure, and always on.

Don’t wait for the next disruption, future-proof your infrastructure today.