Chapter 5: Identity – The System Kill Switch

← Hard Truths | Silent Outages →

In many modern architectures, identity failures can disable systems at customer-journey level. They usually do not fail gracefully.

Yet most teams model identity as a third-party SLA they inherit rather than as a failure domain they control.

This chapter exists to change that.

Why identity often behaves as Tier-0

Every request in a modern system goes through identity:

User logs in (OIDC/OAuth provider)
Session is created (session store)
Token is issued (key server)
Token is validated (token service)
Authorization decision is made (policy engine)
Request proceeds

If one of these steps fails in a critical path, the request pipeline can fail fast for a large share of users.

The dependency is usually unidirectional in practice: many customer journeys depend on identity first, so identity outages propagate quickly.

The cascade: Login fails → users cannot reach the application → support surge → customers leave → business damage.

This can happen even when core compute and storage remain healthy.

The identity failure modes nobody plans for

Mode 1: Token refresh failure

Your application caches tokens for 5 minutes. Token service has a transient issue. New tokens cannot be issued.

What happens:

Existing tokens work (already cached)
New users cannot log in
Returning users cannot refresh (token expires, request fails)
After 5-10 minutes, logged-in users start failing
Platform-wide authentication failure

Why teams miss this: The service is “responding.” It is just refusing valid token requests. Monitoring shows “no errors” because the system is working as designed (rejecting invalid requests). It is the validity that is wrong.

Mode 2: Federation drift

Your system federates with external identity providers. One provider is upgraded, behavior changes subtly, your assumptions break.

Real example:

Provider changed token response format slightly
Your parser still works (backwards compatible)
But a claim you depend on is now missing
Your authorization logic skips that claim
Users get access they should not have (if authorization is role-based)
Or users lose access they should have (if it is attribute-based)

The gap: Integration tests pass. Production breaks. By the time you notice, duplicates may have been created, sensitive data accessed, or systems corrupted.

Mode 3: Session store failure

Your session store is a shared Redis or similar. It fails or becomes unreachable.

Options:

Reject all requests (safe but harsh)
Fall back to in-memory sessions (works until the node dies)
Trust the token alone (security risk if token validation is weak)

What actually happens: You probably pick option 2 without thinking. Then a node fails, another node takes traffic, sessions are lost, users are kicked out.

Mode 4: Third-party SLA failure

Your identity provider is Entra ID, Auth0, Okta, Cognito, IAM Identity Center, or similar. They have an SLA.

The SLA is usually 99.9%. That sounds good. It is not:

You cannot failover to another provider instantly
Your customers cannot take their sessions elsewhere
You are building on someone else’s infrastructure

What happens during their incident:

The provider is within SLA but your system is offline
You have no failover
You pay no credits
Your customers bear the cost

The gap: You have no fallback. If your identity provider is down for 20 minutes (within their SLA), your system is down for 20 minutes (outside yours).

Mode 5: Secret rotation failures

Identity requires secrets: API keys, signing keys, client secrets, certificates.

Rotating secrets safely requires:

New secrets are deployed
Old secrets still work (dual write)
Systems detect and use new ones
Old secrets are retired

If any step fails:

New deployments cannot authenticate
Old deployments cannot authenticate to new systems
Cascading failures cascade further

The trap: Most teams have a secret rotation procedure they have never actually tested under production load while handling normal traffic.

Mode 6: Cross-region identity consistency

Your system spans regions. Identity decisions must be consistent across regions. This requires replication and eventual consistency.

The problem: During replication lag:

User logs in in Region A
Request goes to Region B
Region B has not seen the login yet
Request is rejected

This is not a failure. It is a timing issue. It is also a user experience nightmare.

The multiplier: If your policy data (who can access what) is region-replicated, policy changes can take 30 seconds to propagate. During that time, users either have access they should not or lack access they should.

What you should be doing (and probably are not)

1. Identity is architecture, not infrastructure

Treat it like you treat your database:

Design for failure explicitly
Have a fallback strategy
Test recovery procedures
Monitor independent of the provider

2. Implement a fallback identity source

If your primary identity provider fails:

Do you have a read-only copy you can fall back to?
Do you have cached session data you can use?
Can you issue temporary tokens based on previous session?

Most teams answer “no” to all three. That is a structural vulnerability.

3. Understand your identity provider’s blast radius

When your identity provider fails:

Does it fail gracefully? (returns a clear error)
Does it timeout? (your system times out waiting)
Does it become partially unavailable? (some regions work, others do not)
What is the failure detection time? (how long before you know?)

Test this. Actually test it. Not by reading documentation. By breaking it in staging and seeing what happens.

4. Model identity in your disaster recovery plan

Most DR plans cover:

Data replication
Database failover
Geographic failover

They do not cover:

What happens if identity is unavailable
How you validate that recovery worked
Whether you can issue temporary credentials
How you coordinate with identity team during incident

Add this to your DR runbook explicitly.

5. Separate authentication from authorization

Authentication: “Are you who you say you are?”
Authorization: “Are you allowed to do this?”

If your authorization depends on real-time policy evaluation from your identity provider, you have added a dependency. Cache policies. Refresh periodically. Fall back to cached policy during failures.

6. Measure identity SLI separately from application SLI

Your application SLI: “Successful transactions / attempted transactions”

Your identity SLI: “Successful authentications / attempted authentications”

These should be separate metrics. If identity is at 99% and application is at 99%, your combined system is at 98.01%.

Track them independently so you see when identity is the problem.

7. Test token expiration under load

Token refresh is easy to test in the lab. It is harder to test when:

Your service is under load
You are in the middle of a deployment
Your identity provider is slow
Your cache is warm but expiring

Regularly run gamedays where you degrade identity performance and watch how the system behaves.

The uncomfortable truth

Your identity provider may have a better SLA than your system does. That does not guarantee protection. The provider can be within contract while your customer journey is still degraded or unavailable.

Until you have a fallback, a local cache, or a secondary provider, you are betting your uptime on someone else’s infrastructure with no redundancy.

That is the vulnerability that identity failures exploit.

Key architecture principle

Identity should not have an unmitigated single point of failure you do not control.

If your only identity source is a third-party API:

You have accepted SLA-bound uptime
You have accepted their failure modes
You have accepted their recovery time
You have no option to make it faster

8. Implement monitoring for identity failures

Do not wait for users to discover the problem.

Detection Queries

Token Refresh Failure Rate (Alert if > 0.1% in 5 min)

# Pseudocode for monitoring system
import time

class TokenRefreshMonitor:
    def track_refresh(self, success: bool, latency_ms: int):
        # Track: success rate, latency, errors
        window = get_5_minute_window()
        window.record(success=success, latency=latency_ms)
        
        # Alert if failure rate spikes
        success_rate = window.success_count / window.total_count
        if success_rate < 0.999:  # < 99.9% success
            alert(f"Token refresh at {success_rate*100:.2f}%, below threshold")

# Bonus: Track refresh latency percentiles
# If p99 latency > 5 seconds, client timeouts increase
monitor.alert_if(latency_p99 > 5000)

Session Store Replication Lag (Alert if > 1 second)

# If sessions are replicated, verify lag is low
redis_master = connect_to_master()
redis_replica = connect_to_replica()

# Write marker to master
marker_key = f"replication_check:{time.time()}"
redis_master.set(marker_key, "1")

# Check when it appears on replica
start = time.time()
timeout = 5  # 5 second timeout

while time.time() - start < timeout:
    if redis_replica.get(marker_key):
        lag_seconds = time.time() - start
        if lag_seconds > 1.0:
            alert(f"Session replication lag: {lag_seconds:.2f}s")
        break
    time.sleep(0.01)
else:
    alert(f"Session replication lag: > 5 seconds (timeout)")

Certificate Expiration (Alert 30 days before expiry)

from cryptography import x509
from cryptography.hazmat.backends import default_backend

def check_cert_expiration(cert_path):
    with open(cert_path, 'rb') as f:
        cert = x509.load_pem_x509_certificate(f.read(), default_backend())
    
    expiry = cert.not_valid_after
    days_remaining = (expiry - datetime.now()).days
    
    if days_remaining < 30:
        alert(f"Certificate expires in {days_remaining} days")
    elif days_remaining < 7:
        alert(f"CRITICAL: Certificate expires in {days_remaining} days")

Identity Provider Health (Ping every minute)

def monitor_identity_provider_health():
    # Periodically verify the identity provider is responding
    try:
        response = requests.get(
            f"{IDENTITY_PROVIDER_URL}/.well-known/openid-configuration",
            timeout=2
        )
        
        if response.status_code == 200:
            latency = response.elapsed.total_seconds() * 1000
            
            if latency > 500:
                log_warning(f"IdP latency high: {latency}ms")
            
            # Update health status
            health_status['identity_provider_ok'] = True
            health_status['last_check'] = datetime.now()
        else:
            alert(f"IdP returned {response.status_code}")
            health_status['identity_provider_ok'] = False
    
    except requests.Timeout:
        alert(f"IdP timeout (2s)")
        health_status['identity_provider_ok'] = False
    except Exception as e:
        alert(f"IdP check failed: {e}")
        health_status['identity_provider_ok'] = False

9. Implement a 2-hour fallback: Degraded-Mode Authentication

When identity provider fails, do not reject all requests. Instead, operate in degraded mode using cached tokens and permissions.

Implementation Pattern

from functools import wraps
from datetime import datetime, timedelta
import jwt

class DegradedModeAuthenticator:
    """Fall back to cached tokens when identity provider is unavailable."""
    
    def __init__(self, cache):
        self.cache = cache  # Redis or similar
        self.provider_health_key = "identity_provider_healthy"
        self.fallback_ttl = 3600  # 1 hour fallback
    
    def is_provider_healthy(self):
        """Check if identity provider is currently responsive."""
        # Simple: last successful request was recent
        last_ok = self.cache.get("identity_provider_last_ok")
        if not last_ok:
            return False
        
        last_ok_time = datetime.fromisoformat(last_ok)
        return datetime.now() - last_ok_time < timedelta(minutes=1)
    
    def authenticate_request(self, request):
        """
        Try to authenticate against identity provider.
        Fall back to cached token if provider is down.
        """
        
        # Step 1: Extract the token from the request
        token = extract_bearer_token(request)
        if not token:
            return None
        
        # Step 2: Try to validate against identity provider (primary path)
        if self.is_provider_healthy():
            try:
                user = self.validate_token_with_provider(token)
                
                # Mark provider as healthy and cache the result
                self.cache.set("identity_provider_last_ok", datetime.now().isoformat())
                self.cache.set(f"cached_token:{token}", {
                    'user': user,
                    'validated_at': datetime.now().isoformat()
                }, ex=self.fallback_ttl)
                
                return user
            
            except IdentityProviderException as e:
                if "provider unavailable" in str(e).lower():
                    # Provider is down, switch to fallback
                    log_warning(f"Identity provider unavailable: {e}")
                    self.cache.delete("identity_provider_healthy")
                else:
                    # Token is invalid (not a provider issue)
                    return None
        
        # Step 3: Fall back to cached token validation
        cached_result = self.cache.get(f"cached_token:{token}")
        if cached_result:
            user = cached_result['user']
            validated_at = datetime.fromisoformat(cached_result['validated_at'])
            age_seconds = (datetime.now() - validated_at).total_seconds()
            
            # Accept cached tokens up to 1 hour old during outage
            if age_seconds < 3600:
                log_info(f"Using cached token for user {user['id']} (age: {age_seconds}s)")
                
                # Mark this session as degraded
                user['degraded_mode'] = True
                return user
            else:
                log_warning(f"Cached token expired ({age_seconds}s old)")
        
        # Step 4: No valid token available
        return None
    
    def validate_token_with_provider(self, token):
        """
        Validate token against the actual identity provider.
        Raises IdentityProviderException if provider is unavailable.
        """
        try:
            response = requests.post(
                f"{IDENTITY_PROVIDER_URL}/validate",
                json={'token': token},
                timeout=2
            )
            
            if response.status_code == 200:
                return response.json()['user']
            elif response.status_code == 401:
                raise ValueError("Token invalid")
            else:
                raise IdentityProviderException(
                    f"Provider returned {response.status_code}"
                )
        
        except requests.Timeout:
            raise IdentityProviderException("Provider timeout")
        except requests.ConnectionError as e:
            raise IdentityProviderException(f"Provider unavailable: {e}")


# Attach to request handling
def authenticate_request(request):
    """Middleware that wraps requests with degraded-mode authentication."""
    authenticator = DegradedModeAuthenticator(cache=redis_client)
    user = authenticator.authenticate_request(request)
    
    if not user:
        raise Unauthorized("Authentication failed")
    
    request.user = user
    
    # Log degraded mode usage for alerting
    if user.get('degraded_mode'):
        metrics.increment('auth.degraded_mode_usage')

Recovery Validation Checklist

When identity provider comes back online:

def validate_identity_recovery():
    """Verify that identity provider recovery is complete."""
    
    checks = {
        'provider_responding': False,
        'tokens_valid': False,
        'new_tokens_issued': False,
        'degraded_mode_sessions_replaced': False,
        'no_orphaned_sessions': False
    }
    
    # Check 1: Provider is responding to health checks
    try:
        health = requests.get(
            f"{IDENTITY_PROVIDER_URL}/.well-known/openid-configuration",
            timeout=2
        )
        checks['provider_responding'] = health.status_code == 200
    except Exception:
        return checks  # Still down
    
    # Check 2: New tokens can be issued
    try:
        token = request_new_token("test@internal", "test_password")
        checks['new_tokens_issued'] = token is not None
    except Exception as e:
        log_warning(f"Cannot issue new tokens: {e}")
    
    # Check 3: Existing tokens validate correctly
    sample_tokens = redis_client.scan_iter("cached_token:*", count=10)
    valid_count = 0
    for token_key in sample_tokens:
        token = token_key.split(':')[1]
        try:
            result = validate_token_with_provider(token)
            valid_count += 1
        except Exception:
            pass
    
    checks['tokens_valid'] = valid_count > 0
    
    # Check 4: Clear degraded-mode session flag (users can refresh)
    # This happens automatically on next token refresh
    
    # Only declare recovery if ALL checks pass
    if all(checks.values()):
        log_info("Identity recovery validated. Resuming normal operation.")
        return checks
    else:
        log_warning(f"Identity recovery incomplete: {checks}")
        return checks

# Call this periodically while provider is recovering
schedule.every(10).seconds.do(validate_identity_recovery)

10. Test identity failover before you need it

Do not discover your identity fallback does not work during an outage.

Gameday Scenario: Identity Provider Down

def gameday_identity_provider_down():
    """Test: What happens when identity provider is offline?"""
    
    print("=== GAMEDAY: Identity Provider Down ===")
    
    # Step 1: Simulate provider unavailability
    mock_identity_provider.set_status("unavailable")
    
    # Step 2: Try normal user workflows
    results = {
        'login_new_user': test_login("new@example.com"),
        'refresh_token': test_token_refresh(),
        'access_protected_resource': test_access("user_data"),
        'api_calls_succeed': test_api_calls(100),
    }
    
    # Step 3: Check degraded mode kicked in
    metrics = get_current_metrics()
    results['degraded_mode_active'] = metrics['auth.degraded_mode_usage'] > 0
    
    # Step 4: Simulate provider recovery
    mock_identity_provider.set_status("healthy")
    
    # Step 5: Verify recovery works
    time.sleep(5)
    results['provider_recovery_detected'] = metrics['identity_provider_health'] == 'ok'
    results['new_tokens_issued'] = test_new_token_after_recovery() is not None
    
    # Report
    print(f"Results: {results}")
    assert all(results.values()), "Identity failover failed"

Run this gameday:

Monthly (minimum)
Before major releases
When you change identity architecture

The uncomfortable truth

Until you have a fallback, a local cache, or a secondary provider, you are betting your uptime on someone else’s infrastructure with no redundancy.

That is the vulnerability that identity failures exploit.

Key architecture principle

Identity should not have an unmitigated single point of failure you do not control.

If your only identity source is a third-party API:

You have accepted SLA-bound uptime
You have accepted their failure modes
You have accepted their recovery time
You have no option to make it faster

That is a structural choice. Make it intentionally, not by accident.

Chapter index

Chapter	Topic
Chapter 1	Opening thesis: reliability as economic decision
Chapter 2	Incentives and organizational failure
Chapter 3	The things that actually break
Shared Responsibility	Shared responsibility and accountability vacuum
Chapter 4	The financial model
Chapter 5	Provider failures and status page reality
Chapter 6	Partial failures and degraded-state design
Chapter 5 (Alt)	Identity as a Tier-0 failure domain
Chapter 7	Hidden cost of observability tooling
Chapter 8	Trade-offs: on-call, FinOps, and human cost
Chapter 9	Governance system
Chapter 10	Execution and the next quarter
Chapter 12	Reliability pricing and the SaaS margin trap
Appendix	Operating artifacts and policy templates
Chapter 13	Maturity and organizational adoption

I work at Microsoft. The views expressed here are my own and based solely on publicly available information. This content is for educational purposes and does not represent official Microsoft guidance or commitments.