← Identity – The System Kill Switch | How You See (and Miss) Reality →
Your system is returning 200 OK.
Your error rate is 0.001%.
Your latencies are normal.
Your data is corrupted.
You will not know for hours, days, or weeks.
This is the silent failure: the system continues to operate, responding correctly to requests, while integrity degrades in the background. By the time you notice, the damage has compounded through dozens of downstream systems.
Silent failures are worse than loud failures
When your service crashes:
- Monitoring alerts immediately
- Users know something is wrong
- You can issue a status page update
- Recovery is visible and bounded
When your data is corrupted:
- The system returns success
- You cannot distinguish between “correct data returned” and “corrupted data returned”
- The corruption spreads to other systems that read the data
- You discover it when a human spots an anomaly (or worse, a customer does)
The classic patterns
Pattern 1: Partial writes
Your system writes to multiple destinations:
- Primary database
- Cache
- Search index
- Analytics pipeline
A request succeeds in destinations 1 and 2, fails in 3 and 4. Your application considers this “success” because the primary write worked.
The result: Your cache and search index are inconsistent with your database. Queries work fine until someone searches. Then they see outdated results, deleted items, or duplicates.
The detection problem: You monitor write latency and success rate. Both look fine. Your cache hit rate looks fine. Then one day someone notices their purchase is missing from the order history.
Pattern 2: Distributed transaction partial failure
You have a workflow:
- Debit account A
- Credit account B
- Log transaction
- Send notification
Step 3 or 4 fails. The money moved but no record exists (or the user got no notification).
In financial systems, this is a nightmare. In less critical systems, it still breaks user experience and creates support noise.
The detection problem: Your primary operations succeeded. Your failure rate shows 0.1%. Nobody knows that 0.1% of transactions are missing their audit trail.
Pattern 3: Replication lag masquerades as consistency
Your data is replicated to a secondary for resilience. During normal operation, replication lag is 100ms. That is fine.
Then something happens:
- Network partition
- Slow disk on replica
- GC pause
- Migration in progress
Replication lag jumps to 5 seconds. Then 30 seconds. Then your application times out waiting and falls back to the stale replica.
Requests start returning old data.
The detection problem: You monitor replication lag. But your threshold is “alert if > 60 seconds.” You do not alert at 30 seconds because that has never caused problems before. But for this workload, 30 seconds is catastrophic.
Pattern 4: Deletion cascades that do not cascade
You delete a user. Your application code is supposed to cascade:
- Delete user record
- Delete user sessions
- Delete user preferences
- Delete user audit logs
Step 1 succeeds. Steps 2-4 fail due to a dependency issue (session service is slow, preferences service is down).
The result: User is deleted but their ghost data remains. Orphaned sessions exist. When you reuse the user ID, you might pick up old sessions from the deleted user.
The detection problem: The deletion was reported as successful. Your orphaned data cleanup runs on a schedule (probably never) and does not cover this case.
Pattern 5: Idempotency boundaries are wrong
You designed your system to be idempotent: the same request can be retried safely.
However, your idempotency check is wrong:
- You check by (user_id, request_id) but the same request_id is reused across sessions
- You check by timestamp but clocks are not synchronized
- You check by content hash but the content has cosmetic differences
Now retries create duplicates instead of being safe.
The detection problem: Your request traces look correct. Retries are being issued as designed. You do not notice the duplicates until billing runs: customers are charged twice.
Pattern 6: Observability data loss masquerades as normal
Your observability pipeline has backpressure. When you are under load, logs and traces are dropped.
That is fine for debugging. It is not fine when the dropped data is the only evidence of corruption.
Real scenario: A data race causes incorrect behavior. It happens 0.01% of the time. Your observability system samples 10% of requests. The probability that the race is captured: extremely low. You will see error rate increase by 0.001% (0.01% * 10%) which falls below your alert threshold.
The detection problem: There is no log of the race happening. You see errors in production but cannot reproduce them. You eventually find the issue by code review, not by observability.
What you should be doing
1. Define your consistency model explicitly
For each data flow, answer:
- Is this transaction or eventually consistent?
- What is the maximum acceptable staleness?
- What happens if a write fails partway through?
- How do you recover from partial failures?
Most teams never answer these questions. They just hope.
2. Build consistency verification into operations
Schedule a job that:
- Reads from primary and replicas
- Compares the values
- Reports inconsistencies
- Attempts to repair (or raises a ticket)
This is not real-time consistency. It is eventual consistency verification.
3. Implement compensating transactions
For workflows that span multiple services:
- Plan the rollback path before you build the happy path
- Build the rollback as a first-class operation
- Test that rollback actually works
- Monitor for stuck transactions that got partially rolled back
4. Separate audit logging from operational logging
Your observability pipeline is for debugging. It can drop data under load.
Your audit log is for forensics. It must not drop data.
Use different systems:
- Observability (logs, traces): sampling OK, latency sensitive
- Audit (immutable log): sampling NOT OK, latency tolerant
5. Monitor for corruption indicators, not just error rates
Error rate 0.001%: normal. However:
- Cache hit rate dropped from 95% to 85%: investigate
- Search result count differs from database count: investigate
- Replication lag is now 5 seconds when it was 100ms: investigate
- User table row count does not match session table count: investigate
These are not errors. They are integrity warnings.
6. Build “consistency queries” into production monitoring
For critical data flows, run queries that verify integrity:
-- Check for orphaned records
SELECT COUNT(*) FROM sessions WHERE user_id NOT IN (SELECT id FROM users);
-- Check for stale cache
SELECT COUNT(*) FROM cache WHERE value != (
SELECT value FROM primary WHERE key = cache.key
);
-- Check for replication lag
SELECT EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp()))
AS replication_lag_seconds;
Run these periodically. Alert if counts are non-zero or lag exceeds threshold.
7. Test silent failures explicitly
In your chaos engineering, include:
- Write to primary succeeds, write to replica fails
- Primary returns data, replica returns stale version
- Cache has stale value while database is correct
- Partial write: first table succeeds, second table fails
- Replication lag exceeds your assumed threshold
Watch how your application behaves. Fix it before it is a production issue.
8. Implement read verification
For critical reads, verify the answer makes sense:
# Read from cache first
value = cache.get(key)
# If we got something, verify it matches source
if value is not None:
primary_value = primary.get(key)
if value != primary_value:
# Cache is stale, log incident
log_consistency_violation(key, cache=value, primary=primary_value)
# Return correct value
value = primary_value
This catches cache staleness in real time.
Detection Queries by Database Type
Theory without practice is a lecture. Here are the actual queries you run to detect silent failures.
Run these. Do not skip them because they look obvious. Silent failures look obvious in hindsight.
Tier 1: Run Every 5 Minutes (Most Critical)
These are the canaries for catastrophic data loss. Alert immediately if counts are non-zero.
PostgreSQL / SQL
-- Orphaned foreign key records (data exists in child but not parent)
SELECT tablename, orphaned_count
FROM (
SELECT 'orders_without_users' as tablename,
COUNT(*) as orphaned_count
FROM orders
WHERE user_id NOT IN (SELECT id FROM users)
UNION ALL
SELECT 'sessions_without_users' as tablename,
COUNT(*) as orphaned_count
FROM sessions
WHERE user_id NOT IN (SELECT id FROM users)
) counts
WHERE orphaned_count > 0;
-- Replication lag (primary to replica) in seconds
SELECT EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp()))
AS replication_lag_seconds;
-- Uncommitted transactions that are blocking others
SELECT pid, usename, query, state_change
FROM pg_stat_activity
WHERE state = 'active'
AND query_start < now() - interval '5 minutes'
ORDER BY query_start DESC;
-- Table bloat (dead rows that slow queries)
SELECT schemaname, tablename,
pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size,
ROUND(100 * (pg_total_relation_size(schemaname||'.'||tablename) - pg_relation_size(schemaname||'.'||tablename))
/ pg_total_relation_size(schemaname||'.'||tablename)) AS bloat_pct
FROM pg_tables
WHERE schemaname NOT IN ('pg_catalog', 'information_schema')
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC
LIMIT 10;
MySQL / MariaDB
-- Orphaned foreign key records
SELECT COUNT(*) as orphaned_orders
FROM orders o
LEFT JOIN users u ON o.user_id = u.id
WHERE u.id IS NULL;
-- Replication lag (seconds behind primary)
SHOW SLAVE STATUS\G
-- Look for: Seconds_Behind_Master
-- InnoDB redo log utilization
SELECT VARIABLE_VALUE
FROM information_schema.global_variables
WHERE variable_name = 'innodb_log_file_size';
-- Slow queries in the slow query log
SELECT count_star, sum_timer_wait, sql_text
FROM performance_schema.events_statements_summary_by_digest
ORDER BY count_star DESC
LIMIT 5;
DynamoDB (AWS SDK / CloudWatch)
# Check for write throttling
import boto3
cloudwatch = boto3.client('cloudwatch')
response = cloudwatch.get_metric_statistics(
Namespace='AWS/DynamoDB',
MetricName='WriteThrottleEvents',
StartTime=datetime.now() - timedelta(minutes=5),
EndTime=datetime.now(),
Period=60,
Statistics=['Sum']
)
throttle_events = sum(dp['Sum'] for dp in response['Datapoints'])
if throttle_events > 0:
print(f"ALERT: Write throttled {throttle_events} times in 5 minutes")
# Check for item collection size limit violations
# (happens when an item exceeds 10GB due to versioning)
response = cloudwatch.get_metric_statistics(
Namespace='AWS/DynamoDB',
MetricName='ItemCollectionSizeExceeded',
StartTime=datetime.now() - timedelta(minutes=5),
EndTime=datetime.now(),
Period=60,
Statistics=['Sum']
)
Redis
# Memory usage (if it hits max, writes silently fail)
redis-cli INFO memory | grep -E "used_memory|maxmemory|evicted_keys"
# Replication lag (if Redis has replicas)
redis-cli INFO replication | grep -E "role|slave|offset"
# Expired keys stuck in memory (if your expiration is not working)
redis-cli DBSIZE
redis-cli SCAN 0 TYPE string COUNT 1000 # Sample for TTLs
redis-cli TTL <sample_key> # -1 means no expiration
# Slow log (commands that are slow)
redis-cli SLOWLOG GET 10
Message Queues (SQS / RabbitMQ)
# AWS SQS: Check for messages that are not being consumed
import boto3
sqs = boto3.client('sqs')
queue_url = 'https://sqs.us-east-1.amazonaws.com/123456789012/my-queue'
attrs = sqs.get_queue_attributes(
QueueUrl=queue_url,
AttributeNames=['ApproximateNumberOfMessages', 'ApproximateNumberOfMessagesNotVisible']
)
visible = int(attrs['Attributes']['ApproximateNumberOfMessages'])
in_flight = int(attrs['Attributes']['ApproximateNumberOfMessagesNotVisible'])
if visible > 1000 and in_flight < 10:
print(f"ALERT: Messages backing up. {visible} waiting, only {in_flight} in flight")
# Dead-letter queue (messages that failed and were retried too many times)
dlq_attrs = sqs.get_queue_attributes(
QueueUrl='https://sqs.us-east-1.amazonaws.com/123456789012/my-queue-dlq',
AttributeNames=['ApproximateNumberOfMessages']
)
dlq_count = int(dlq_attrs['Attributes']['ApproximateNumberOfMessages'])
if dlq_count > 0:
print(f"ALERT: {dlq_count} messages in dead-letter queue")
Tier 2: Run Every Hour (Important Consistency Checks)
These catch subtle corruption that spreads slowly. Alerting threshold is usually non-zero (some corruption is normal after partial failures), but sudden increases are dangerous.
PostgreSQL
-- Check for rows with NULL in columns that should have values
SELECT tablename, column_name, null_count
FROM (
SELECT 'users' as tablename, 'email' as column_name,
COUNT(*) as null_count
FROM users WHERE email IS NULL
UNION ALL
SELECT 'orders' as tablename, 'total_amount' as column_name,
COUNT(*) as null_count
FROM orders WHERE total_amount IS NULL OR total_amount <= 0
) nulls
WHERE null_count > 0;
-- Check for duplicate unique keys (should never happen)
SELECT column_name, value, count(*) as duplicate_count
FROM (
SELECT email, email as value, COUNT(*)
FROM users
GROUP BY email
HAVING COUNT(*) > 1
) duplicates
GROUP BY column_name, value;
-- Check for cascade delete failures
-- (user deleted but their orders remain)
SELECT COUNT(*) as orphaned_orders
FROM orders
WHERE user_id NOT IN (SELECT id FROM users);
-- Check for inconsistent totals
-- (order total != sum of line items)
SELECT COUNT(*) as broken_totals
FROM orders o
WHERE o.total_amount != (
SELECT COALESCE(SUM(unit_price * quantity), 0)
FROM order_items
WHERE order_id = o.id
);
MongoDB
# Check for missing indexes that are causing slow queries
import pymongo
client = pymongo.MongoClient('mongodb://localhost:27017/')
db = client['production']
# Find collections with full table scans
for collection_name in db.list_collection_names():
stats = db.command('collStats', collection_name)
# If totalIndexSize is 0, no indexes exist
if stats.get('totalIndexSize', 0) == 0 and stats['count'] > 1000:
print(f"WARNING: {collection_name} has {stats['count']} documents but no indexes")
# Check for duplicate _id values (should be impossible)
for collection_name in db.list_collection_names():
collection = db[collection_name]
duplicates = collection.aggregate([
{'$group': {'_id': '$_id', 'count': {'$sum': 1}}},
{'$match': {'count': {'$gt': 1}}}
])
dup_count = sum(1 for _ in duplicates)
if dup_count > 0:
print(f"ALERT: {collection_name} has {dup_count} duplicate _id values")
DynamoDB
# Check for data validation anomalies using Athena queries
import boto3
athena = boto3.client('athena')
# Example: Check for orders with timestamps in the future
query = """
SELECT COUNT(*) as future_orders
FROM my_dynamodb_table
WHERE created_at > current_timestamp
"""
response = athena.start_query_execution(
QueryString=query,
QueryExecutionContext={'Database': 'my_database'},
ResultConfiguration={'OutputLocation': 's3://my-bucket/results/'}
)
Tier 3: Run Daily (Trending & Anomalies)
These are the slow checks that reveal long-term degradation. Run at low traffic times.
PostgreSQL / MySQL
-- Database growth rate (are you storing things you should not?)
SELECT
datname as database_name,
pg_size_pretty(pg_database_size(datname)) as size,
COUNT(*) * 8 / 1024.0 as size_change_mb_since_yesterday
FROM pg_database
WHERE datistemplate = false
ORDER BY pg_database_size(datname) DESC;
-- Table growth rate (which tables are bloating?)
SELECT
schemaname,
tablename,
pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) as size,
ROUND(100.0 * (EXTRACT(EPOCH FROM (now() - last_vacuum)) / 86400)) as days_since_vacuum
FROM pg_tables
LEFT JOIN pg_stat_user_tables USING (tablename)
WHERE schemaname NOT IN ('pg_catalog', 'information_schema')
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC
LIMIT 20;
-- Backup success rate (are backups actually working?)
SELECT
backup_id,
database_name,
backup_time,
backup_size,
verification_status
FROM backups
WHERE backup_time > now() - interval '24 hours'
ORDER BY backup_time DESC;
Frequency Recommendations by Tier
| Tier | Frequency | Alert Threshold | Example |
|---|---|---|---|
| Tier 1 | Every 5 minutes | Non-zero (immediate escalate) | Replication lag, orphaned records, write throttling |
| Tier 2 | Every hour | Usually zero, alert on change | Duplicate keys, cascade failures, data validation |
| Tier 3 | Daily | Trend-based | Table growth, backup validation, slow queries |
Real Incident Mappings
Incident: GitHub Database Replication Lag (2017)
What happened: Replication lag drifted from 100ms to 8 seconds during a deploy. Application timed out and fell back to stale replica. Users saw 6-hour-old repos.
Detection query (Tier 1):
SELECT EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp()))
AS replication_lag_seconds;
Should have alerted when: Lag exceeded 2 seconds (not 60 seconds)
Resolution: Tuned alert threshold from 60s to 5s. Added canary queries that verify replica is fresh before using it.
Incident: Discord Cache Partitioning (2016)
What happened: Cache layer partitioned during network event. Half the servers cached one version of user data, half cached another. Users saw data flicker between two states.
Detection query (Tier 2):
# Sample cache entries and compare to database
for key in redis.scan(match='user:*', count=100):
cache_value = redis.get(key)
db_value = db.get(key)
if cache_value != db_value:
log_inconsistency(key, cache=cache_value, db=db_value)
Should have alerted when: 10+ inconsistencies detected in one minute
Resolution: Built cache verification into request path. On mismatch, invalidate cache and return fresh data.
Incident: Stripe Failed Cascade Delete (2014)
What happened: Deleted a customer. Billing records remained. New customer with same ID received old invoices.
Detection query (Tier 1):
SELECT COUNT(*) as orphaned_invoices
FROM invoices
WHERE customer_id NOT IN (SELECT id FROM customers);
Should have alerted when: Any orphaned invoices existed (should be zero)
Resolution: Made cascade deletes explicit. Built pre- and post-delete validation. Never rely on implicit cascades.
Response Procedures: Detect → Isolate → Reconcile → Validate
When a consistency query triggers:
1. Detect (Now)
- Query confirmed corruption exists
- Example: 47 orphaned records found
- Alert: Yes, escalate immediately
2. Isolate (0-5 minutes)
- Stop writes to affected tables (if safe)
- Example: Set orders table to read-only
- Prevent corruption from spreading
- Notify team: “Table X is read-only during reconciliation”
3. Reconcile (5-60 minutes)
- Repair the corruption
- Manual examples:
-- Delete orphaned records DELETE FROM orders WHERE user_id NOT IN (SELECT id FROM users); -- Recalculate totals UPDATE orders SET total = (SELECT SUM(price) FROM order_items WHERE order_id = orders.id); - Validate that repair succeeded
4. Validate (Before resuming)
- Re-run the detection query (should show zero)
- Sample queries to verify data looks correct
- Resume writes only after validation passes
- Postmortem: Why did the corruption happen?
Threshold Tuning: When to Alert
Anti-pattern: “Alert when count > 0”
- Triggers too often for inevitable small corruptions
- Teams create alert suppression rules (defeats the purpose)
Better pattern: Alert when count is abnormal for your system
# Pseudocode: Smart threshold
orphaned_records = run_query("SELECT COUNT(*) orphaned FROM orders...")
# Baseline: What is normal for this system?
# Most systems have occasional orphaned records from:
# - Cascade delete timing windows
# - Partial failure recovery
baseline = 0 # or 1 or 5, depends on your system
if orphaned_records > baseline * 10:
# Major corruption, alert immediately
alert("CRITICAL: Orphaned record count {orphaned_records}, baseline {baseline}")
elif orphaned_records > baseline + 1 and orphaned_records > baseline * 1.5:
# Trending toward corruption, investigate
alert("WARNING: Orphaned count increased from {baseline} to {orphaned_records}")
else:
# Normal
pass
Automation: Running Queries in Production
Option 1: Scheduled job (recommended)
import schedule
import time
def run_consistency_checks():
results = {
'timestamp': datetime.now(),
'checks': []
}
# Tier 1 every 5 minutes
results['checks'].append(check_replication_lag())
results['checks'].append(check_orphaned_records())
results['checks'].append(check_write_throttling())
# Tier 2 every hour
if datetime.now().minute == 0:
results['checks'].append(check_duplicate_keys())
results['checks'].append(check_cascade_deletes())
# Tier 3 daily at 2am
if datetime.now().hour == 2 and datetime.now().minute == 0:
results['checks'].append(check_table_bloat())
results['checks'].append(check_backup_success())
# Send results to observability system
observability.send(results)
# Alert on any failures
for check in results['checks']:
if check['status'] == 'FAILED':
alert.send(check)
schedule.every(5).minutes.do(run_consistency_checks)
while True:
schedule.run_pending()
time.sleep(1)
Option 2: Observability system SQL job
- Set up recurring SQL jobs in your observability tool (e.g., DataDog, New Relic, Datadog)
- They run the queries on schedule and alert on anomalies
- Easier than maintaining code, but less flexible
Option 3: Database-native jobs
- PostgreSQL: pg_cron extension
- MySQL: Event scheduler
- Less flexible, but no external dependencies
The uncomfortable truth
The systems that fail silently are the ones you are confident about.
You are confident because:
- They have been running for months
- Your error rate is low
- Your monitoring shows green
- You have not had an incident in 6 months
That confidence is exactly when data corruption is most likely to go unnoticed.
By the time you detect it, it has propagated through three systems and your customers have discovered it in their reports.
Key architecture principle
Silent failures require different monitoring than loud failures.
Loud failures (crashes, timeouts, 500 errors) show up in error rate.
Silent failures (data corruption, inconsistency, stale reads) do not. They show up in:
- Consistency queries
- Data verification
- Comparison between systems
- Human-observed anomalies
If you are only monitoring error rate, you are not monitoring for the worst failures.
Chapter index
| Chapter | Topic |
|---|---|
| Chapter 1 | Opening thesis: reliability as economic decision |
| Chapter 2 | Incentives and organizational failure |
| Chapter 3 | The things that actually break |
| Shared Responsibility | Shared responsibility and accountability vacuum |
| Chapter 4 | The financial model |
| Chapter 5 | Provider failures and status page reality |
| Chapter 6 | Partial failures and degraded-state design |
| Chapter 5 (Alt) | Identity as a Tier-0 failure domain |
| Chapter 6 (Alt) | Silent outages and data corruption |
| Chapter 7 | Hidden cost of observability tooling |
| Chapter 8 | Trade-offs: on-call, FinOps, and human cost |
| Chapter 9 | Governance system |
| Chapter 10 | Execution and the next quarter |
| Chapter 12 | Reliability pricing and the SaaS margin trap |
| Appendix | Operating artifacts and policy templates |
| Chapter 13 | Maturity and organizational adoption |
I work at Microsoft. The views expressed here are my own and based solely on publicly available information. This content is for educational purposes and does not represent official Microsoft guidance or commitments.