Most teams read the status page only when they are already in trouble. The better use is to read it before anything breaks and treat it as a record of the risks the organization is already accepting.
The public incident history across AWS, Azure, and GCP is enough to establish three important truths.
The chapter combines publicly documented incidents with field patterns seen in enterprise operations. Treat the patterns as directional operating guidance, not immutable laws.
First, provider-level incidents are not rare enough to ignore.
Second, most major events are not simple hardware failures. They are configuration mistakes, dependency cascades, control-plane issues, and coordination failures.
Third, customers who do not model those events explicitly end up inheriting a risk appetite they never actually chose.
What the public record tells us
Across the major providers, the broad pattern is consistent:
- Tier 1 events happen infrequently, but not so infrequently that a multi-year workload can ignore them.
- Tier 2 events happen often enough that a region-dependent business path should expect them.
- Tier 3 events happen frequently enough that resilience patterns should absorb them as normal operating conditions.
This is not a vendor indictment. It is the operating reality of complex cloud platforms.
The repeating patterns
Three patterns show up repeatedly in public incident records:
- configuration changes create outsized failures
- one impaired dependency cascades into many apparently unrelated services
- provider monitoring or status channels degrade during the event itself
That combination matters. Customers do not only need platform resilience. They need the ability to operate when platform clarity is partial or delayed.
Default topology is not a strategy
One of the most common provider mistakes is treating the platform’s default topology as if it were a resilience decision the business made on purpose.
It was not.
Teams inherit a default region, a paired region, a zone-redundant option, or a managed replication setting and quietly assume the provider has already decided the disaster recovery model for them.
That is convenience, not strategy.
High availability and disaster recovery solve different problems. High availability is about near-zero interruption for the active workload. Disaster recovery is about acceptable loss and acceptable delay when the active environment is no longer viable. The geography can overlap. The objectives cannot.
That distinction matters because failover is rarely a magical provider event. It is usually a customer decision with customer trade-offs:
- which region actually has the required service and SKU parity
- whether latency, compliance, and data residency still work after failover
- whether the secondary region has enough quota and capacity to absorb the load
- whether identity, DNS, and control-plane operations still function in the failover path
- whether failback is defined, rehearsed, and operationally affordable
If those answers are unknown, the backup region is a comforting story, not a tested resilience posture.
A regional strategy should explicitly classify three states:
- Required: multi-region is mandatory due to outage cost, legal exposure, or mission impact
- Optional: multi-region is beneficial but not required at current business risk level
- Infeasible for now: multi-region economics or supply constraints do not clear the threshold yet, so risks are accepted and documented
What this means for customer risk
The question is not whether a provider incident will affect the workload over its lifetime. The question is what that exposure means to revenue, trust, contracts, and the operating cadence of the business.
If your primary revenue path depends on one region, you are accepting that a multi-hour outage is likely over the system lifetime.
Once that framing becomes normal, the status page stops being a place to visit during panic and becomes part of pre-incident planning.
Use the appendix provider incident response template to formalize actions and tested versus assumed status: content/posts/000028-reliability-operating-artifacts-and-policy-templates.md.
Practical model
| Event class | Typical meaning for the customer |
|---|---|
| Tier 1 provider event | strategic exposure to primary provider becomes visible |
| Tier 2 provider event | regional or platform-layer interruption likely within a year |
| Tier 3 provider event | normal resilience patterns should absorb most of the pain |
Reference baseline (Azure)
For teams running on Azure, these two documents are useful baselines when translating provider constraints into customer reliability design:
- Availability zones overview: https://learn.microsoft.com/en-us/azure/reliability/availability-zones-overview?tabs=azure-cli
- Service-level agreements: https://learn.microsoft.com/en-us/azure/reliability/concept-service-level-agreements
What to do this quarter
- Review incidents relevant to your primary provider, region, and services from the last 12 months.
- Price a three-hour Tier 2 event during business hours.
- Identify the one provider-side event class your current design handles worst.
- Validate that the supposed failover region actually has the service, SKU parity, quota runway, and operational readiness you are counting on.
- Classify your regional posture as required, optional, or currently infeasible, and record the rationale.
- Record the accepted exposure explicitly.
Bottom line
Status pages are not only operational notifications. They are free actuarial tables. Teams that ignore them are often choosing risk by omission rather than decision.
Chapter bridge
Chapter 6 shifts from provider incident evidence to partial-failure design, where degraded behavior becomes the normal planning assumption.