What can we learn from the recent AWS outage, and how can we apply those lessons to our own infrastructure?
On October 20, 2025, AWS experienced a major disruption that rippled across the internet (and social media), affecting widely used services such as Zoom, Microsoft Teams, Slack, and Atlassian. The issue originated not in a single data center or customer workload, but in the AWS control plane, the management layer that coordinates how resources like EC2 instances, DynamoDB tables, and IAM roles operate.
The initial trigger appears to be a DNS failure around the DynamoDB API endpoint in the US-EAST-1 region, compounded by a malfunction in the subsystem that monitors network load balancers. Because those health-monitoring services also run in US-EAST-1, AWS throttled new EC2 instance launches while restoring the subsystem.
Though AWS markets its regions as isolated clusters of data centers with independent power and cooling, this incident showed that core control-plane functions remain centralized, creating hidden dependencies that can cascade globally.
Analysts quickly identified that US-EAST-1 hosts AWS’s common control plane that supports a host of global services. Workloads running in Europe or Asia actually rely on API calls that route back through or to US-EAST-1, so the failure there had global consequences.
When the region’s DNS and health-check subsystems failed, those control-plane calls stalled worldwide. The end result was a global slowdown in EC2 launches, configuration updates, and authentication, despite the other regions being technically “healthy.”
AWS’s own design guidance encourages customers to spread workloads across availability zones for resiliency, but these customer-facing resiliency mechanisms ultimately depend on the same centralized control plane. In other words, data-plane isolation worked as designed, but control-plane isolation did not.
This pattern has surfaced before, not just at AWS. Cloudflare, Microsoft, and Google have all suffered outages triggered by control-plane or configuration failures that propagated globally. The lesson here is that in modern distributed systems, control-plane fragility can become a single point of failure.
AWS may be in the spotlight now, but looking across the industry, nearly every major cloud or CDN provider, AWS, Cloudflare, Microsoft, Google, has experienced control-plane-related outages in the past five years. These are rarely caused by attacks; more often, they stem from routine operational changes, misconfigurations, or centralized service dependencies.
The October 2025 AWS outage simply demonstrates that no cloud provider is immune. The best defense is architectural: distribute risk, decouple dependencies, and design for graceful degradation.
We’re proud that Wallarm’s Security Edge demonstrates how these lessons can be applied proactively. By shifting API protection into a multi-cloud, active-active edge fabric, organizations gain resilience not just against attacks, but against the infrastructure failures that even the largest providers occasionally suffer.
Understanding and analyzing major outages like AWS’s is essential for infrastructure engineers. Each incident reveals gaps between design assumptions and real-world complexity, exposing weak points that might otherwise go unnoticed, and ultimately impact service availability. By studying what failed, why it failed, and how recovery proceeds, architects can refine their infrastructure and systems to be more fault-tolerant and resilient. This mindset of continuous learning ensures that when the next disruption happens, the impact on users and business operations is minimized. So what are the key lessons here?
1. Design for true multi-region, active-active operation. A standby region isn’t enough if control traffic can’t reach it. Run in an active-active configuration so that the loss of one resource or region doesn’t disable the service.
2. Avoid single-region control planes: It seems obvious to say now, but configuration and metadata services should be either fully local or replicated globally. Any system that depends on a single region’s DNS, load balancers, or other systems for coordination introduces a global risk.
3. Separate the control plane from the data plane: The AWS incident began in the control plane but quickly cascaded to the data plane. Architect systems so runtime traffic can continue independently from configuration or provisioning systems.
4. Distribute DNS and caching layers: Multi-provider DNS with long TTLs can reduce the impact of control-plane-related resolution failures. Cached or regional read replicas keep applications partially functional when the origin is unreachable.
5. Implement circuit breakers and bulkhead isolation: Systems should fail fast and gracefully. Circuit breakers can reroute or degrade functionality instead of hammering a failing endpoint. Bulkhead isolation limits the spread of failures between components.
6. Continuously test failure scenarios: Regular “chaos” testing validates that redundancy, DNS failover, and RTO/RPO objectives work in practice. AWS’s own Resilience Hub encourages these tests, but the lesson applies to any cloud or hybrid deployment. Check out Chaos Monkey, introduced by Netflix in 2011, as an example.
7. Plan for multi-cloud or hybrid resiliency: Multi-availability-zone redundancy doesn’t protect against control-plane issues. Deploying across multiple cloud providers or keeping a minimal on-prem footprint prevents total dependence on one provider’s management systems.
8. Decouple capacity from failover logic: AWS mitigated the outage by throttling new instance launches, buying time, not resilience. Reserve compute in secondary regions ahead of time, but ensure failover logic works autonomously from the control plane.
In designing the architecture for Security Edge, we wanted to address many of these specific challenges. Today, Wallarm’s Security Edge embodies many of these principles by design. It was intentionally built to avoid the single-provider fragility exposed by the AWS outage. Security Edge includes:

Join Wallarm’s Field CTO, Tim Ebbers this Thursday, October 23, for an in depth conversation on this AWS outage and architecting for resiliency.
The post AWS Outage: Lessons Learned appeared first on Wallarm.
*** This is a Security Bloggers Network syndicated blog from Wallarm authored by Tim Erlin. Read the original post at: https://lab.wallarm.com/aws-outage-lessons-learned/