21 Oct 2025

Managing Concentration Risk & Service Layer Outages in the Cloud: Lessons from the AWS Outage

Written by The Node4 Team

5:15

Cloud

In this post

← All Posts

When the Cloud Goes Down: Why It Matters

Earlier this week, millions of users and businesses across the globe were reminded just how much we rely on the cloud. A major AWS outage in the US-EAST-1 region didn’t just take down a few services – it disrupted everything from HMRC and airport check-in systems to Slack, Snapchat, and Fortnite. The story made mainstream news in the UK, underscoring just how far-reaching these incidents can be.

But here’s the uncomfortable truth: these outages aren’t rare, and they’re not unique to AWS. Over the past few years, every major cloud provider – Microsoft, Google, AWS, and others – has suffered significant service disruptions. The question isn’t if another outage will happen, but when.

Recent Cloud Failures & Outages

Let’s take a quick look at some of the most impactful outages from the last few years:

2025 

Company	Cause
AWS	DNS resolution failure in disrupted DynamoDB, affecting Snapchat, Fortnite, Coinbase, Canva, Zoom, Slack, and more.
Microsoft 365	Service degradation due to backend configuration issues impacted Teams, Outlook, and SharePoint.
Google Cloud	IAM and networking issues caused outages in Gmail, YouTube, and GCP APIs.

2024 

Company	Cause
Google Cloud Frankfurt	A significant power failure caused downtime across European clients.
CrowdStrike	A faulty update disabled millions of Windows machines globally.
Azure Central US	Configuration failure impacted VMs, Cosmos DB, and Microsoft 365.

2023 

Company	Cause
AWS US-EAST-1	Regional outage affected Slack, Zoom, and banking apps.
Azure	DNS and routing issues disrupted Microsoft 365 and Azure DevOps.

2022 

Company	Cause
Cloudflare CDN	Cloudflare CDN – DNS and edge server failure disrupted Google, Amazon, Facebook, and more.
Azure	Authentication and access control issues caused global service degradation.

2021 

Company	Cause
AWS	Control plane overload in US-EAST-1 disrupted EC2, Lambda, and EventBridge.
Azure	Global DNS outage affected Microsoft 365, Teams, and DevOps.
Akamai	DNS misconfig caused global web outages for FedEx, Steam, and PlayStation Network.

Why Are Cloud Outages So Disruptive?

Public Cloud Services are a mesh of interdependent services that are largely abstracted away from customers, so if one component has an issue, such as DNS resolution for DynamoDB it can bring a whole cascade of other services down, as per the AWS outage. In fact, two of the biggest culprits of cloud failures are DNS and Centralised Management Planes (e.g. authentication. Outages of these types can be global and impactful across services.

The real kicker? Many cloud providers use their own platforms to run their operational systems. So, a failure in one area can quickly cascade, making recovery slow and complex.

Understanding Concentration Risk

Concentration risk is what happens when an organisation (or even an entire sector) relies too heavily on a single cloud provider or service. In the UK, the public cloud market is dominated by just three players, making it tough to diversify and limit exposure. The data shows that outages are not only more likely, but they can also be global, impacting your core infrastructure, SaaS providers, and supply chain all at once.

So, What Can You Do About It?

1. Be Honest About the Trade-Offs

Going all-in on one vendor brings speed and innovation, but it also increases risk. Make sure your business leaders understand the pros and cons and that risk tolerance is agreed up front.

2. Embrace Hybrid and Multi-Cloud

Hybrid and multi-cloud strategies are your best bet for resilience. Hybrid cloud, in particular, lets you bring in non-public cloud options for critical workloads. Many organisations (including Node4 clients) use hybrid cloud to increase diversity and reduce risk. It’s easier if you stick to portable, standard infrastructure like VMs or containers but even PaaS services can be run across platforms with tools like Azure Arc.

3. Don’t Forget the Edge

Many organisations run edge services for supported SaaS applications, ensuring that even if the cloud goes down, core business operations can continue.

4. Plan for Recovery – But Be Realistic

If your disaster recovery plan relies on the same vendor’s infrastructure, it might not help during a major outage. Test your plans and be transparent about what’s possible and what isn’t.

Quick Checklist: Building Cloud Resilience

Map your dependencies – know which services are critical and where they run.

Diversify where it matters most (hybrid, multi-cloud, edge).

Regularly review and test your disaster recovery plans.

Communicate risks and trade-offs clearly to business stakeholders.

Stay informed about new tools and approaches for cross-cloud resilience.

Final Thoughts

Cloud outages are here to stay, but with the right strategy, you can manage concentration risk and keep your business running even when the unexpected happens. The technology, platforms, and expertise exist to help you build a more resilient, hybrid cloud future.