← Hard Truths | Silent Outages →
In many modern architectures, identity failures can disable systems at customer-journey level. They usually do not fail gracefully.
Yet most teams model identity as a third-party SLA they inherit rather than as a failure domain they control.
This chapter exists to change that.
Why identity often behaves as Tier-0
Every request in a modern system goes through identity:
- User logs in (OIDC/OAuth provider)
- Session is created (session store)
- Token is issued (key server)
- Token is validated (token service)
- Authorization decision is made (policy engine)
- Request proceeds
If one of these steps fails in a critical path, the request pipeline can fail fast for a large share of users.
The dependency is usually unidirectional in practice: many customer journeys depend on identity first, so identity outages propagate quickly.
The cascade: Login fails → users cannot reach the application → support surge → customers leave → business damage.
This can happen even when core compute and storage remain healthy.
The identity failure modes nobody plans for
Mode 1: Token refresh failure
Your application caches tokens for 5 minutes. Token service has a transient issue. New tokens cannot be issued.
What happens:
- Existing tokens work (already cached)
- New users cannot log in
- Returning users cannot refresh (token expires, request fails)
- After 5-10 minutes, logged-in users start failing
- Platform-wide authentication failure
Why teams miss this: The service is “responding.” It is just refusing valid token requests. Monitoring shows “no errors” because the system is working as designed (rejecting invalid requests). It is the validity that is wrong.
Mode 2: Federation drift
Your system federates with external identity providers. One provider is upgraded, behavior changes subtly, your assumptions break.
Real example:
- Provider changed token response format slightly
- Your parser still works (backwards compatible)
- But a claim you depend on is now missing
- Your authorization logic skips that claim
- Users get access they should not have (if authorization is role-based)
- Or users lose access they should have (if it is attribute-based)
The gap: Integration tests pass. Production breaks. By the time you notice, duplicates may have been created, sensitive data accessed, or systems corrupted.
Mode 3: Session store failure
Your session store is a shared Redis or similar. It fails or becomes unreachable.
Options:
- Reject all requests (safe but harsh)
- Fall back to in-memory sessions (works until the node dies)
- Trust the token alone (security risk if token validation is weak)
What actually happens: You probably pick option 2 without thinking. Then a node fails, another node takes traffic, sessions are lost, users are kicked out.
Mode 4: Third-party SLA failure
Your identity provider is Entra ID, Auth0, Okta, Cognito, IAM Identity Center, or similar. They have an SLA.
The SLA is usually 99.9%. That sounds good. It is not:
- You cannot failover to another provider instantly
- Your customers cannot take their sessions elsewhere
- You are building on someone else’s infrastructure
What happens during their incident:
- The provider is within SLA but your system is offline
- You have no failover
- You pay no credits
- Your customers bear the cost
The gap: You have no fallback. If your identity provider is down for 20 minutes (within their SLA), your system is down for 20 minutes (outside yours).
Mode 5: Secret rotation failures
Identity requires secrets: API keys, signing keys, client secrets, certificates.
Rotating secrets safely requires:
- New secrets are deployed
- Old secrets still work (dual write)
- Systems detect and use new ones
- Old secrets are retired
If any step fails:
- New deployments cannot authenticate
- Old deployments cannot authenticate to new systems
- Cascading failures cascade further
The trap: Most teams have a secret rotation procedure they have never actually tested under production load while handling normal traffic.
Mode 6: Cross-region identity consistency
Your system spans regions. Identity decisions must be consistent across regions. This requires replication and eventual consistency.
The problem: During replication lag:
- User logs in in Region A
- Request goes to Region B
- Region B has not seen the login yet
- Request is rejected
This is not a failure. It is a timing issue. It is also a user experience nightmare.
The multiplier: If your policy data (who can access what) is region-replicated, policy changes can take 30 seconds to propagate. During that time, users either have access they should not or lack access they should.
What you should be doing (and probably are not)
1. Identity is architecture, not infrastructure
Treat it like you treat your database:
- Design for failure explicitly
- Have a fallback strategy
- Test recovery procedures
- Monitor independent of the provider
2. Implement a fallback identity source
If your primary identity provider fails:
- Do you have a read-only copy you can fall back to?
- Do you have cached session data you can use?
- Can you issue temporary tokens based on previous session?
Most teams answer “no” to all three. That is a structural vulnerability.
3. Understand your identity provider’s blast radius
When your identity provider fails:
- Does it fail gracefully? (returns a clear error)
- Does it timeout? (your system times out waiting)
- Does it become partially unavailable? (some regions work, others do not)
- What is the failure detection time? (how long before you know?)
Test this. Actually test it. Not by reading documentation. By breaking it in staging and seeing what happens.
4. Model identity in your disaster recovery plan
Most DR plans cover:
- Data replication
- Database failover
- Geographic failover
They do not cover:
- What happens if identity is unavailable
- How you validate that recovery worked
- Whether you can issue temporary credentials
- How you coordinate with identity team during incident
Add this to your DR runbook explicitly.
5. Separate authentication from authorization
- Authentication: “Are you who you say you are?”
- Authorization: “Are you allowed to do this?”
If your authorization depends on real-time policy evaluation from your identity provider, you have added a dependency. Cache policies. Refresh periodically. Fall back to cached policy during failures.
6. Measure identity SLI separately from application SLI
Your application SLI: “Successful transactions / attempted transactions”
Your identity SLI: “Successful authentications / attempted authentications”
These should be separate metrics. If identity is at 99% and application is at 99%, your combined system is at 98.01%.
Track them independently so you see when identity is the problem.
7. Test token expiration under load
Token refresh is easy to test in the lab. It is harder to test when:
- Your service is under load
- You are in the middle of a deployment
- Your identity provider is slow
- Your cache is warm but expiring
Regularly run gamedays where you degrade identity performance and watch how the system behaves.
The uncomfortable truth
Your identity provider may have a better SLA than your system does. That does not guarantee protection. The provider can be within contract while your customer journey is still degraded or unavailable.
Until you have a fallback, a local cache, or a secondary provider, you are betting your uptime on someone else’s infrastructure with no redundancy.
That is the vulnerability that identity failures exploit.
Key architecture principle
Identity should not have an unmitigated single point of failure you do not control.
If your only identity source is a third-party API:
- You have accepted SLA-bound uptime
- You have accepted their failure modes
- You have accepted their recovery time
- You have no option to make it faster
8. Implement monitoring for identity failures
Do not wait for users to discover the problem.
Detection Queries
Token Refresh Failure Rate (Alert if > 0.1% in 5 min)
# Pseudocode for monitoring system
import time
class TokenRefreshMonitor:
def track_refresh(self, success: bool, latency_ms: int):
# Track: success rate, latency, errors
window = get_5_minute_window()
window.record(success=success, latency=latency_ms)
# Alert if failure rate spikes
success_rate = window.success_count / window.total_count
if success_rate < 0.999: # < 99.9% success
alert(f"Token refresh at {success_rate*100:.2f}%, below threshold")
# Bonus: Track refresh latency percentiles
# If p99 latency > 5 seconds, client timeouts increase
monitor.alert_if(latency_p99 > 5000)
Session Store Replication Lag (Alert if > 1 second)
# If sessions are replicated, verify lag is low
redis_master = connect_to_master()
redis_replica = connect_to_replica()
# Write marker to master
marker_key = f"replication_check:{time.time()}"
redis_master.set(marker_key, "1")
# Check when it appears on replica
start = time.time()
timeout = 5 # 5 second timeout
while time.time() - start < timeout:
if redis_replica.get(marker_key):
lag_seconds = time.time() - start
if lag_seconds > 1.0:
alert(f"Session replication lag: {lag_seconds:.2f}s")
break
time.sleep(0.01)
else:
alert(f"Session replication lag: > 5 seconds (timeout)")
Certificate Expiration (Alert 30 days before expiry)
from cryptography import x509
from cryptography.hazmat.backends import default_backend
def check_cert_expiration(cert_path):
with open(cert_path, 'rb') as f:
cert = x509.load_pem_x509_certificate(f.read(), default_backend())
expiry = cert.not_valid_after
days_remaining = (expiry - datetime.now()).days
if days_remaining < 30:
alert(f"Certificate expires in {days_remaining} days")
elif days_remaining < 7:
alert(f"CRITICAL: Certificate expires in {days_remaining} days")
Identity Provider Health (Ping every minute)
def monitor_identity_provider_health():
# Periodically verify the identity provider is responding
try:
response = requests.get(
f"{IDENTITY_PROVIDER_URL}/.well-known/openid-configuration",
timeout=2
)
if response.status_code == 200:
latency = response.elapsed.total_seconds() * 1000
if latency > 500:
log_warning(f"IdP latency high: {latency}ms")
# Update health status
health_status['identity_provider_ok'] = True
health_status['last_check'] = datetime.now()
else:
alert(f"IdP returned {response.status_code}")
health_status['identity_provider_ok'] = False
except requests.Timeout:
alert(f"IdP timeout (2s)")
health_status['identity_provider_ok'] = False
except Exception as e:
alert(f"IdP check failed: {e}")
health_status['identity_provider_ok'] = False
9. Implement a 2-hour fallback: Degraded-Mode Authentication
When identity provider fails, do not reject all requests. Instead, operate in degraded mode using cached tokens and permissions.
Implementation Pattern
from functools import wraps
from datetime import datetime, timedelta
import jwt
class DegradedModeAuthenticator:
"""Fall back to cached tokens when identity provider is unavailable."""
def __init__(self, cache):
self.cache = cache # Redis or similar
self.provider_health_key = "identity_provider_healthy"
self.fallback_ttl = 3600 # 1 hour fallback
def is_provider_healthy(self):
"""Check if identity provider is currently responsive."""
# Simple: last successful request was recent
last_ok = self.cache.get("identity_provider_last_ok")
if not last_ok:
return False
last_ok_time = datetime.fromisoformat(last_ok)
return datetime.now() - last_ok_time < timedelta(minutes=1)
def authenticate_request(self, request):
"""
Try to authenticate against identity provider.
Fall back to cached token if provider is down.
"""
# Step 1: Extract the token from the request
token = extract_bearer_token(request)
if not token:
return None
# Step 2: Try to validate against identity provider (primary path)
if self.is_provider_healthy():
try:
user = self.validate_token_with_provider(token)
# Mark provider as healthy and cache the result
self.cache.set("identity_provider_last_ok", datetime.now().isoformat())
self.cache.set(f"cached_token:{token}", {
'user': user,
'validated_at': datetime.now().isoformat()
}, ex=self.fallback_ttl)
return user
except IdentityProviderException as e:
if "provider unavailable" in str(e).lower():
# Provider is down, switch to fallback
log_warning(f"Identity provider unavailable: {e}")
self.cache.delete("identity_provider_healthy")
else:
# Token is invalid (not a provider issue)
return None
# Step 3: Fall back to cached token validation
cached_result = self.cache.get(f"cached_token:{token}")
if cached_result:
user = cached_result['user']
validated_at = datetime.fromisoformat(cached_result['validated_at'])
age_seconds = (datetime.now() - validated_at).total_seconds()
# Accept cached tokens up to 1 hour old during outage
if age_seconds < 3600:
log_info(f"Using cached token for user {user['id']} (age: {age_seconds}s)")
# Mark this session as degraded
user['degraded_mode'] = True
return user
else:
log_warning(f"Cached token expired ({age_seconds}s old)")
# Step 4: No valid token available
return None
def validate_token_with_provider(self, token):
"""
Validate token against the actual identity provider.
Raises IdentityProviderException if provider is unavailable.
"""
try:
response = requests.post(
f"{IDENTITY_PROVIDER_URL}/validate",
json={'token': token},
timeout=2
)
if response.status_code == 200:
return response.json()['user']
elif response.status_code == 401:
raise ValueError("Token invalid")
else:
raise IdentityProviderException(
f"Provider returned {response.status_code}"
)
except requests.Timeout:
raise IdentityProviderException("Provider timeout")
except requests.ConnectionError as e:
raise IdentityProviderException(f"Provider unavailable: {e}")
# Attach to request handling
def authenticate_request(request):
"""Middleware that wraps requests with degraded-mode authentication."""
authenticator = DegradedModeAuthenticator(cache=redis_client)
user = authenticator.authenticate_request(request)
if not user:
raise Unauthorized("Authentication failed")
request.user = user
# Log degraded mode usage for alerting
if user.get('degraded_mode'):
metrics.increment('auth.degraded_mode_usage')
Recovery Validation Checklist
When identity provider comes back online:
def validate_identity_recovery():
"""Verify that identity provider recovery is complete."""
checks = {
'provider_responding': False,
'tokens_valid': False,
'new_tokens_issued': False,
'degraded_mode_sessions_replaced': False,
'no_orphaned_sessions': False
}
# Check 1: Provider is responding to health checks
try:
health = requests.get(
f"{IDENTITY_PROVIDER_URL}/.well-known/openid-configuration",
timeout=2
)
checks['provider_responding'] = health.status_code == 200
except Exception:
return checks # Still down
# Check 2: New tokens can be issued
try:
token = request_new_token("test@internal", "test_password")
checks['new_tokens_issued'] = token is not None
except Exception as e:
log_warning(f"Cannot issue new tokens: {e}")
# Check 3: Existing tokens validate correctly
sample_tokens = redis_client.scan_iter("cached_token:*", count=10)
valid_count = 0
for token_key in sample_tokens:
token = token_key.split(':')[1]
try:
result = validate_token_with_provider(token)
valid_count += 1
except Exception:
pass
checks['tokens_valid'] = valid_count > 0
# Check 4: Clear degraded-mode session flag (users can refresh)
# This happens automatically on next token refresh
# Only declare recovery if ALL checks pass
if all(checks.values()):
log_info("Identity recovery validated. Resuming normal operation.")
return checks
else:
log_warning(f"Identity recovery incomplete: {checks}")
return checks
# Call this periodically while provider is recovering
schedule.every(10).seconds.do(validate_identity_recovery)
10. Test identity failover before you need it
Do not discover your identity fallback does not work during an outage.
Gameday Scenario: Identity Provider Down
def gameday_identity_provider_down():
"""Test: What happens when identity provider is offline?"""
print("=== GAMEDAY: Identity Provider Down ===")
# Step 1: Simulate provider unavailability
mock_identity_provider.set_status("unavailable")
# Step 2: Try normal user workflows
results = {
'login_new_user': test_login("new@example.com"),
'refresh_token': test_token_refresh(),
'access_protected_resource': test_access("user_data"),
'api_calls_succeed': test_api_calls(100),
}
# Step 3: Check degraded mode kicked in
metrics = get_current_metrics()
results['degraded_mode_active'] = metrics['auth.degraded_mode_usage'] > 0
# Step 4: Simulate provider recovery
mock_identity_provider.set_status("healthy")
# Step 5: Verify recovery works
time.sleep(5)
results['provider_recovery_detected'] = metrics['identity_provider_health'] == 'ok'
results['new_tokens_issued'] = test_new_token_after_recovery() is not None
# Report
print(f"Results: {results}")
assert all(results.values()), "Identity failover failed"
Run this gameday:
- Monthly (minimum)
- Before major releases
- When you change identity architecture
The uncomfortable truth
Your identity provider may have a better SLA than your system does. That does not guarantee protection. The provider can be within contract while your customer journey is still degraded or unavailable.
Until you have a fallback, a local cache, or a secondary provider, you are betting your uptime on someone else’s infrastructure with no redundancy.
That is the vulnerability that identity failures exploit.
Key architecture principle
Identity should not have an unmitigated single point of failure you do not control.
If your only identity source is a third-party API:
- You have accepted SLA-bound uptime
- You have accepted their failure modes
- You have accepted their recovery time
- You have no option to make it faster
That is a structural choice. Make it intentionally, not by accident.
Chapter index
| Chapter | Topic |
|---|---|
| Chapter 1 | Opening thesis: reliability as economic decision |
| Chapter 2 | Incentives and organizational failure |
| Chapter 3 | The things that actually break |
| Shared Responsibility | Shared responsibility and accountability vacuum |
| Chapter 4 | The financial model |
| Chapter 5 | Provider failures and status page reality |
| Chapter 6 | Partial failures and degraded-state design |
| Chapter 5 (Alt) | Identity as a Tier-0 failure domain |
| Chapter 7 | Hidden cost of observability tooling |
| Chapter 8 | Trade-offs: on-call, FinOps, and human cost |
| Chapter 9 | Governance system |
| Chapter 10 | Execution and the next quarter |
| Chapter 12 | Reliability pricing and the SaaS margin trap |
| Appendix | Operating artifacts and policy templates |
| Chapter 13 | Maturity and organizational adoption |
I work at Microsoft. The views expressed here are my own and based solely on publicly available information. This content is for educational purposes and does not represent official Microsoft guidance or commitments.