The Reliability Survival Guide — Book

Chapter 5: Identity – The System Kill Switch

Identity failures disable everything downstream. Yet most teams treat identity as infrastructure and third-party SLAs as sufficient. This chapter shows why identity must be a primary failure domain with explicit resilience architecture.

cloud-architecturereliabilitysecurityarchitecture

Hard Truths | Silent Outages →


In many modern architectures, identity failures can disable systems at customer-journey level. They usually do not fail gracefully.

Yet most teams model identity as a third-party SLA they inherit rather than as a failure domain they control.

This chapter exists to change that.

Why identity often behaves as Tier-0

Every request in a modern system goes through identity:

  • User logs in (OIDC/OAuth provider)
  • Session is created (session store)
  • Token is issued (key server)
  • Token is validated (token service)
  • Authorization decision is made (policy engine)
  • Request proceeds

If one of these steps fails in a critical path, the request pipeline can fail fast for a large share of users.

The dependency is usually unidirectional in practice: many customer journeys depend on identity first, so identity outages propagate quickly.

The cascade: Login fails → users cannot reach the application → support surge → customers leave → business damage.

This can happen even when core compute and storage remain healthy.

The identity failure modes nobody plans for

Mode 1: Token refresh failure

Your application caches tokens for 5 minutes. Token service has a transient issue. New tokens cannot be issued.

What happens:

  • Existing tokens work (already cached)
  • New users cannot log in
  • Returning users cannot refresh (token expires, request fails)
  • After 5-10 minutes, logged-in users start failing
  • Platform-wide authentication failure

Why teams miss this: The service is “responding.” It is just refusing valid token requests. Monitoring shows “no errors” because the system is working as designed (rejecting invalid requests). It is the validity that is wrong.

Mode 2: Federation drift

Your system federates with external identity providers. One provider is upgraded, behavior changes subtly, your assumptions break.

Real example:

  • Provider changed token response format slightly
  • Your parser still works (backwards compatible)
  • But a claim you depend on is now missing
  • Your authorization logic skips that claim
  • Users get access they should not have (if authorization is role-based)
  • Or users lose access they should have (if it is attribute-based)

The gap: Integration tests pass. Production breaks. By the time you notice, duplicates may have been created, sensitive data accessed, or systems corrupted.

Mode 3: Session store failure

Your session store is a shared Redis or similar. It fails or becomes unreachable.

Options:

  1. Reject all requests (safe but harsh)
  2. Fall back to in-memory sessions (works until the node dies)
  3. Trust the token alone (security risk if token validation is weak)

What actually happens: You probably pick option 2 without thinking. Then a node fails, another node takes traffic, sessions are lost, users are kicked out.

Mode 4: Third-party SLA failure

Your identity provider is Entra ID, Auth0, Okta, Cognito, IAM Identity Center, or similar. They have an SLA.

The SLA is usually 99.9%. That sounds good. It is not:

  • You cannot failover to another provider instantly
  • Your customers cannot take their sessions elsewhere
  • You are building on someone else’s infrastructure

What happens during their incident:

  • The provider is within SLA but your system is offline
  • You have no failover
  • You pay no credits
  • Your customers bear the cost

The gap: You have no fallback. If your identity provider is down for 20 minutes (within their SLA), your system is down for 20 minutes (outside yours).

Mode 5: Secret rotation failures

Identity requires secrets: API keys, signing keys, client secrets, certificates.

Rotating secrets safely requires:

  • New secrets are deployed
  • Old secrets still work (dual write)
  • Systems detect and use new ones
  • Old secrets are retired

If any step fails:

  • New deployments cannot authenticate
  • Old deployments cannot authenticate to new systems
  • Cascading failures cascade further

The trap: Most teams have a secret rotation procedure they have never actually tested under production load while handling normal traffic.

Mode 6: Cross-region identity consistency

Your system spans regions. Identity decisions must be consistent across regions. This requires replication and eventual consistency.

The problem: During replication lag:

  • User logs in in Region A
  • Request goes to Region B
  • Region B has not seen the login yet
  • Request is rejected

This is not a failure. It is a timing issue. It is also a user experience nightmare.

The multiplier: If your policy data (who can access what) is region-replicated, policy changes can take 30 seconds to propagate. During that time, users either have access they should not or lack access they should.

What you should be doing (and probably are not)

1. Identity is architecture, not infrastructure

Treat it like you treat your database:

  • Design for failure explicitly
  • Have a fallback strategy
  • Test recovery procedures
  • Monitor independent of the provider

2. Implement a fallback identity source

If your primary identity provider fails:

  • Do you have a read-only copy you can fall back to?
  • Do you have cached session data you can use?
  • Can you issue temporary tokens based on previous session?

Most teams answer “no” to all three. That is a structural vulnerability.

3. Understand your identity provider’s blast radius

When your identity provider fails:

  • Does it fail gracefully? (returns a clear error)
  • Does it timeout? (your system times out waiting)
  • Does it become partially unavailable? (some regions work, others do not)
  • What is the failure detection time? (how long before you know?)

Test this. Actually test it. Not by reading documentation. By breaking it in staging and seeing what happens.

4. Model identity in your disaster recovery plan

Most DR plans cover:

  • Data replication
  • Database failover
  • Geographic failover

They do not cover:

  • What happens if identity is unavailable
  • How you validate that recovery worked
  • Whether you can issue temporary credentials
  • How you coordinate with identity team during incident

Add this to your DR runbook explicitly.

5. Separate authentication from authorization

  • Authentication: “Are you who you say you are?”
  • Authorization: “Are you allowed to do this?”

If your authorization depends on real-time policy evaluation from your identity provider, you have added a dependency. Cache policies. Refresh periodically. Fall back to cached policy during failures.

6. Measure identity SLI separately from application SLI

Your application SLI: “Successful transactions / attempted transactions”

Your identity SLI: “Successful authentications / attempted authentications”

These should be separate metrics. If identity is at 99% and application is at 99%, your combined system is at 98.01%.

Track them independently so you see when identity is the problem.

7. Test token expiration under load

Token refresh is easy to test in the lab. It is harder to test when:

  • Your service is under load
  • You are in the middle of a deployment
  • Your identity provider is slow
  • Your cache is warm but expiring

Regularly run gamedays where you degrade identity performance and watch how the system behaves.

The uncomfortable truth

Your identity provider may have a better SLA than your system does. That does not guarantee protection. The provider can be within contract while your customer journey is still degraded or unavailable.

Until you have a fallback, a local cache, or a secondary provider, you are betting your uptime on someone else’s infrastructure with no redundancy.

That is the vulnerability that identity failures exploit.


8. Implement monitoring for identity failures

Do not wait for users to discover the problem.

Detection Queries

Token Refresh Failure Rate (Alert if > 0.1% in 5 min)

# Pseudocode for monitoring system
import time

class TokenRefreshMonitor:
    def track_refresh(self, success: bool, latency_ms: int):
        # Track: success rate, latency, errors
        window = get_5_minute_window()
        window.record(success=success, latency=latency_ms)
        
        # Alert if failure rate spikes
        success_rate = window.success_count / window.total_count
        if success_rate < 0.999:  # < 99.9% success
            alert(f"Token refresh at {success_rate*100:.2f}%, below threshold")

# Bonus: Track refresh latency percentiles
# If p99 latency > 5 seconds, client timeouts increase
monitor.alert_if(latency_p99 > 5000)

Session Store Replication Lag (Alert if > 1 second)

# If sessions are replicated, verify lag is low
redis_master = connect_to_master()
redis_replica = connect_to_replica()

# Write marker to master
marker_key = f"replication_check:{time.time()}"
redis_master.set(marker_key, "1")

# Check when it appears on replica
start = time.time()
timeout = 5  # 5 second timeout

while time.time() - start < timeout:
    if redis_replica.get(marker_key):
        lag_seconds = time.time() - start
        if lag_seconds > 1.0:
            alert(f"Session replication lag: {lag_seconds:.2f}s")
        break
    time.sleep(0.01)
else:
    alert(f"Session replication lag: > 5 seconds (timeout)")

Certificate Expiration (Alert 30 days before expiry)

from cryptography import x509
from cryptography.hazmat.backends import default_backend

def check_cert_expiration(cert_path):
    with open(cert_path, 'rb') as f:
        cert = x509.load_pem_x509_certificate(f.read(), default_backend())
    
    expiry = cert.not_valid_after
    days_remaining = (expiry - datetime.now()).days
    
    if days_remaining < 30:
        alert(f"Certificate expires in {days_remaining} days")
    elif days_remaining < 7:
        alert(f"CRITICAL: Certificate expires in {days_remaining} days")

Identity Provider Health (Ping every minute)

def monitor_identity_provider_health():
    # Periodically verify the identity provider is responding
    try:
        response = requests.get(
            f"{IDENTITY_PROVIDER_URL}/.well-known/openid-configuration",
            timeout=2
        )
        
        if response.status_code == 200:
            latency = response.elapsed.total_seconds() * 1000
            
            if latency > 500:
                log_warning(f"IdP latency high: {latency}ms")
            
            # Update health status
            health_status['identity_provider_ok'] = True
            health_status['last_check'] = datetime.now()
        else:
            alert(f"IdP returned {response.status_code}")
            health_status['identity_provider_ok'] = False
    
    except requests.Timeout:
        alert(f"IdP timeout (2s)")
        health_status['identity_provider_ok'] = False
    except Exception as e:
        alert(f"IdP check failed: {e}")
        health_status['identity_provider_ok'] = False

9. Implement a 2-hour fallback: Degraded-Mode Authentication

When identity provider fails, do not reject all requests. Instead, operate in degraded mode using cached tokens and permissions.

Implementation Pattern

from functools import wraps
from datetime import datetime, timedelta
import jwt

class DegradedModeAuthenticator:
    """Fall back to cached tokens when identity provider is unavailable."""
    
    def __init__(self, cache):
        self.cache = cache  # Redis or similar
        self.provider_health_key = "identity_provider_healthy"
        self.fallback_ttl = 3600  # 1 hour fallback
    
    def is_provider_healthy(self):
        """Check if identity provider is currently responsive."""
        # Simple: last successful request was recent
        last_ok = self.cache.get("identity_provider_last_ok")
        if not last_ok:
            return False
        
        last_ok_time = datetime.fromisoformat(last_ok)
        return datetime.now() - last_ok_time < timedelta(minutes=1)
    
    def authenticate_request(self, request):
        """
        Try to authenticate against identity provider.
        Fall back to cached token if provider is down.
        """
        
        # Step 1: Extract the token from the request
        token = extract_bearer_token(request)
        if not token:
            return None
        
        # Step 2: Try to validate against identity provider (primary path)
        if self.is_provider_healthy():
            try:
                user = self.validate_token_with_provider(token)
                
                # Mark provider as healthy and cache the result
                self.cache.set("identity_provider_last_ok", datetime.now().isoformat())
                self.cache.set(f"cached_token:{token}", {
                    'user': user,
                    'validated_at': datetime.now().isoformat()
                }, ex=self.fallback_ttl)
                
                return user
            
            except IdentityProviderException as e:
                if "provider unavailable" in str(e).lower():
                    # Provider is down, switch to fallback
                    log_warning(f"Identity provider unavailable: {e}")
                    self.cache.delete("identity_provider_healthy")
                else:
                    # Token is invalid (not a provider issue)
                    return None
        
        # Step 3: Fall back to cached token validation
        cached_result = self.cache.get(f"cached_token:{token}")
        if cached_result:
            user = cached_result['user']
            validated_at = datetime.fromisoformat(cached_result['validated_at'])
            age_seconds = (datetime.now() - validated_at).total_seconds()
            
            # Accept cached tokens up to 1 hour old during outage
            if age_seconds < 3600:
                log_info(f"Using cached token for user {user['id']} (age: {age_seconds}s)")
                
                # Mark this session as degraded
                user['degraded_mode'] = True
                return user
            else:
                log_warning(f"Cached token expired ({age_seconds}s old)")
        
        # Step 4: No valid token available
        return None
    
    def validate_token_with_provider(self, token):
        """
        Validate token against the actual identity provider.
        Raises IdentityProviderException if provider is unavailable.
        """
        try:
            response = requests.post(
                f"{IDENTITY_PROVIDER_URL}/validate",
                json={'token': token},
                timeout=2
            )
            
            if response.status_code == 200:
                return response.json()['user']
            elif response.status_code == 401:
                raise ValueError("Token invalid")
            else:
                raise IdentityProviderException(
                    f"Provider returned {response.status_code}"
                )
        
        except requests.Timeout:
            raise IdentityProviderException("Provider timeout")
        except requests.ConnectionError as e:
            raise IdentityProviderException(f"Provider unavailable: {e}")


# Attach to request handling
def authenticate_request(request):
    """Middleware that wraps requests with degraded-mode authentication."""
    authenticator = DegradedModeAuthenticator(cache=redis_client)
    user = authenticator.authenticate_request(request)
    
    if not user:
        raise Unauthorized("Authentication failed")
    
    request.user = user
    
    # Log degraded mode usage for alerting
    if user.get('degraded_mode'):
        metrics.increment('auth.degraded_mode_usage')

Recovery Validation Checklist

When identity provider comes back online:

def validate_identity_recovery():
    """Verify that identity provider recovery is complete."""
    
    checks = {
        'provider_responding': False,
        'tokens_valid': False,
        'new_tokens_issued': False,
        'degraded_mode_sessions_replaced': False,
        'no_orphaned_sessions': False
    }
    
    # Check 1: Provider is responding to health checks
    try:
        health = requests.get(
            f"{IDENTITY_PROVIDER_URL}/.well-known/openid-configuration",
            timeout=2
        )
        checks['provider_responding'] = health.status_code == 200
    except Exception:
        return checks  # Still down
    
    # Check 2: New tokens can be issued
    try:
        token = request_new_token("test@internal", "test_password")
        checks['new_tokens_issued'] = token is not None
    except Exception as e:
        log_warning(f"Cannot issue new tokens: {e}")
    
    # Check 3: Existing tokens validate correctly
    sample_tokens = redis_client.scan_iter("cached_token:*", count=10)
    valid_count = 0
    for token_key in sample_tokens:
        token = token_key.split(':')[1]
        try:
            result = validate_token_with_provider(token)
            valid_count += 1
        except Exception:
            pass
    
    checks['tokens_valid'] = valid_count > 0
    
    # Check 4: Clear degraded-mode session flag (users can refresh)
    # This happens automatically on next token refresh
    
    # Only declare recovery if ALL checks pass
    if all(checks.values()):
        log_info("Identity recovery validated. Resuming normal operation.")
        return checks
    else:
        log_warning(f"Identity recovery incomplete: {checks}")
        return checks

# Call this periodically while provider is recovering
schedule.every(10).seconds.do(validate_identity_recovery)

10. Test identity failover before you need it

Do not discover your identity fallback does not work during an outage.

Gameday Scenario: Identity Provider Down

def gameday_identity_provider_down():
    """Test: What happens when identity provider is offline?"""
    
    print("=== GAMEDAY: Identity Provider Down ===")
    
    # Step 1: Simulate provider unavailability
    mock_identity_provider.set_status("unavailable")
    
    # Step 2: Try normal user workflows
    results = {
        'login_new_user': test_login("new@example.com"),
        'refresh_token': test_token_refresh(),
        'access_protected_resource': test_access("user_data"),
        'api_calls_succeed': test_api_calls(100),
    }
    
    # Step 3: Check degraded mode kicked in
    metrics = get_current_metrics()
    results['degraded_mode_active'] = metrics['auth.degraded_mode_usage'] > 0
    
    # Step 4: Simulate provider recovery
    mock_identity_provider.set_status("healthy")
    
    # Step 5: Verify recovery works
    time.sleep(5)
    results['provider_recovery_detected'] = metrics['identity_provider_health'] == 'ok'
    results['new_tokens_issued'] = test_new_token_after_recovery() is not None
    
    # Report
    print(f"Results: {results}")
    assert all(results.values()), "Identity failover failed"

Run this gameday:

  • Monthly (minimum)
  • Before major releases
  • When you change identity architecture

The uncomfortable truth

Your identity provider may have a better SLA than your system does. That does not guarantee protection. The provider can be within contract while your customer journey is still degraded or unavailable.

Until you have a fallback, a local cache, or a secondary provider, you are betting your uptime on someone else’s infrastructure with no redundancy.

That is the vulnerability that identity failures exploit.


Identity architecture: resilience patterns

The failure modes above name the problems. This section names the architectural patterns that address them.

Token cache layering

Do not rely on a single token validation path. Layer caches to reduce identity provider dependency.

Client-side cache. The calling service or browser holds a token until near-expiry. Reduces identity provider calls significantly but creates a window where revoked tokens remain valid.

Service-side cache. Server-side middleware caches validated token results for a short window (30 to 300 seconds depending on security requirements). Protects against identity provider latency spikes and short outages.

Edge cache. For read-heavy workloads, a CDN or API gateway can validate tokens against a cached public key set, allowing auth decisions without reaching the origin identity provider at all.

The key design decision is the revocation window: how long can a revoked or expired token remain valid across each cache layer? Define this explicitly and match it to the workload’s security and reliability requirements.


Managed identity versus application identity tradeoffs

Both options exist for service-to-service authentication in cloud environments.

Managed identity (Azure Managed Identity, AWS IAM Roles for EC2 and ECS, GCP Service Accounts):

  • No secret rotation required
  • Identity lifecycle managed by the platform
  • Scoped to the resource, not the application
  • Failure mode: if the control plane is degraded, token acquisition fails regardless of application health

Application identity (client credentials, service principals, API keys):

  • Explicit rotation required and an engineering responsibility
  • More portable across environments and providers
  • Failure mode: secret rotation errors cause authentication failures; no platform fallback

The practical tradeoff. Managed identity removes human-operated rotation overhead and reduces secret sprawl. The cost is a new dependency on the control plane’s token issuance path. If your SLO requires independence from the control plane during outages, managed identity introduces a structural exposure that requires explicit planning.


Multiple authentication paths

Design authentication to degrade gracefully rather than fail completely.

Primary path: Live validation against the identity provider.

Secondary path: Validation against a local token cache or read-only replica of the identity store.

Degraded path: Accept tokens validated within the last N minutes (configurable) when the identity provider is unreachable, with degraded-mode flagging logged and visible to operators.

The degraded path does not mean unlimited access. It means the system remains partially functional for existing sessions while new logins are held until the identity provider recovers. Define the degraded window explicitly. Common values are 30 minutes to 2 hours depending on the workload’s security posture.


Regional token validation independence

If your system spans multiple regions, token validation must work independently per region during a cross-region event.

The problem. Token validation that routes through a single region creates a single point of failure. If that region is impaired, token validation fails everywhere, even if compute and storage in other regions are healthy.

The pattern:

  • Deploy identity provider replicas or caching proxies in each region
  • Validate tokens locally against regional caches of the public key set (for JWT-based tokens, this is the JWKS endpoint)
  • Refresh key sets on a schedule rather than on every request
  • Ensure that a regional identity replica failure does not block authentication in other regions

Azure-specific. Azure Entra ID is globally distributed. Token validation using the published JWKS endpoint works regionally without routing through a single control plane. The failure mode to plan for is the JWKS refresh path during high-load events. Cache the key set aggressively and refresh on a schedule rather than on each validation request.


Separating control plane auth from application auth

This distinction is critical and frequently overlooked.

Control plane authentication: The identity required to manage, deploy, configure, and scale infrastructure. This is usually platform-managed (Azure Resource Manager, AWS Management Console, GCP IAM).

Application authentication: The identity required for users and services to call your application.

These should have independent failure modes. A control plane authentication impairment (you cannot deploy or scale) should not cascade into application authentication failure (users cannot log in). An application identity provider outage should not require control plane intervention to remediate.

In practice, coupling these paths is common. Service principals used for both deployment pipelines and application authentication create exactly this coupling. The fix is explicit separation: different credentials, different token paths, different failure domains.


Key architecture principle

Identity should not have an unmitigated single point of failure you do not control.

If your only identity source is a third-party API:

  • You have accepted SLA-bound uptime
  • You have accepted their failure modes
  • You have accepted their recovery time
  • You have no option to make it faster

That is a structural choice. Make it intentionally, not by accident.


Chapter bridge

This chapter affects three system properties simultaneously:

  • SLO definition. Your composite SLO must account for identity provider availability as a separate failure domain. An identity SLI should be tracked independently from the application SLI.
  • Failure domain modeling. Identity is a Tier-0 failure domain. It belongs in the dependency graph that Chapter 4 uses to build the financial model. Price an identity outage there.
  • Governance. The token cache strategy, degraded-mode policy, and secret rotation procedure should be architecture decisions with ADRs. Chapter 9 provides the ledger structure.

This chapter connects forward to:

  • Chapter 6a (Partial Failure): Identity failures are often partial, not complete. The degraded-mode authentication pattern in this chapter is the identity-specific application of the partial-failure design model in Chapter 6a.
  • Chapter 7b (Change): Secret rotation and federation upgrades are the most common sources of identity outages caused by change. Chapter 7b’s change management model applies here directly.
  • Chapter 9 (Governance): The identity ADR should be a first-class entry in the reliability ledger, updated whenever the token strategy or provider configuration changes.

Chapter index

ChapterTopic
Chapter 1Opening thesis: reliability as economic decision
Chapter 2Incentives and organizational failure
Chapter 3The things that actually break
Shared ResponsibilityShared responsibility and accountability vacuum
Chapter 4The financial model
Chapter 5Provider failures and status page reality
Chapter 6Partial failures and degraded-state design
Chapter 5 (Alt)Identity as a Tier-0 failure domain
Chapter 7Hidden cost of observability tooling
Chapter 8Trade-offs: on-call, FinOps, and human cost
Chapter 9Governance system
Chapter 10Execution and the next quarter
Chapter 12Reliability pricing and the SaaS margin trap
AppendixOperating artifacts and policy templates
Chapter 13Maturity and organizational adoption

I work at Microsoft. The views expressed here are my own and based solely on publicly available information. This content is for educational purposes and does not represent official Microsoft guidance or commitments.