Before Real Users Break Your ML System, Let Synthetic Data Do It First

Image generated using LLM

We spent six weeks building a recommendation model that worked beautifully in offline evaluation.

Precision at K was strong. NDCG looked clean. Every metric we tracked in the notebook environment told us we were ready. We deployed to a staging environment, ran a smoke test with twenty synthetic users, confirmed predictions were returning correctly, and scheduled the production rollout for the following Monday.

By Monday afternoon, the serving layer was timing out under real traffic.

The model itself was fine. The issue was in the feature retrieval pipeline. In our offline tests, we had been fetching features for one user at a time from a small staging database. In production, the recommendation engine was processing concurrent requests from thousands of users, each requiring a JOIN across four tables to build the feature vector before inference could begin.

The database connection pool was saturated within minutes. Latency climbed from 40 milliseconds to 12 seconds. The model never got a chance to fail because the infrastructure around it collapsed first.

We had tested the model. We had never tested the system.

That distinction matters more than most ML teams realize. A production ML system is not a model. It is a model wrapped in data retrieval logic, feature computation, serialization, network calls, and serving infrastructure. All of those components can fail independently, and they fail in ways that offline evaluation will never reveal.

This article is about using synthetic databases to stress-test that entire system before real users become your QA team.

The Gap Between Offline Evaluation and Production Reality

Offline evaluation answers one question: does the model make good predictions on historical data?

It does not answer:

  • How long does feature retrieval take under concurrent load?
  • What happens when the feature pipeline receives a user with 847 transactions instead of 15?
  • Does the serving layer handle null features gracefully or does it crash?
  • What is the memory footprint of a batch inference job on a realistic database size?
  • Does prediction latency degrade linearly with request volume or does it cliff?

These questions can only be answered by running the full system — not just the model — against data that resembles production in volume, cardinality, and edge case distribution.

Synthetic databases make this possible without exposing real user data to a test environment.

What Infrastructure Stress Testing with Synthetic Data Looks Like

The framework has four components:

  1. Volume replication: Generate a synthetic database at production scale, not test scale.
  2. Concurrency simulation: Generate realistic concurrent request patterns from synthetic users.
  3. Edge case injection: Ensure the synthetic population includes heavy users, sparse users, and users with data characteristics that stress the pipeline.
  4. Latency and throughput measurement: Measure system performance under synthetic load and define acceptance thresholds before production deployment.

Step 1: Generate a Production-Scale Synthetic Database

The most common mistake in load testing ML systems is testing at the wrong scale. A database with 10,000 rows behaves differently from one with 10 million rows, not just quantitatively but qualitatively. Query plans change. Index behavior changes. Connection pool dynamics change.

Generating a production-scale synthetic database is the first step.

python
import pandas as pd
import numpy as np
from faker import Faker
from datetime import datetime, timedelta
import sqlite3
import time
fake = Faker(‘en_IN’)
np.random.seed(42)
def generate_production_scale_database(
n_users=100000,
avg_events_per_user=45,
batch_size=10000
):
“””
Generate a production-scale synthetic database using batch inserts.
Batch processing prevents memory exhaustion at high row counts.
At 100k users with 45 avg events each: ~4.5M event rows.
“””
conn = sqlite3.connect(‘synthetic_stress_test.db’)
# Create tables
conn.execute(“””
CREATE TABLE IF NOT EXISTS users (
user_id TEXT PRIMARY KEY,
signup_date TEXT,
segment TEXT,
country TEXT,
device_type TEXT
)
“””)
conn.execute(“””
CREATE TABLE IF NOT EXISTS events (
event_id TEXT,
user_id TEXT,
event_type TEXT,
event_timestamp TEXT,
item_id TEXT,
session_duration_seconds INTEGER,
amount REAL
)
“””)
conn.execute(“CREATE INDEX IF NOT EXISTS idx_events_user_id ON events(user_id)”)
conn.execute(“CREATE INDEX IF NOT EXISTS idx_events_timestamp ON events(event_timestamp)”)
print(f”Generating {n_users:,} users…”)
start_time = time.time()
start_date = datetime(2023, 1, 1)
end_date = datetime(2026, 1, 1)
span = (end_date — start_date).days
# Batch insert users
for batch_start in range(0, n_users, batch_size):
batch_end = min(batch_start + batch_size, n_users)
batch_n = batch_end — batch_start
user_batch = pd.DataFrame({
‘user_id’: [f’USR{str(i).zfill(8)}’ for i in range(batch_start + 1, batch_end + 1)],
‘signup_date’: [
(start_date + timedelta(days=int(np.random.randint(0, span)))).isoformat()
for _ in range(batch_n)
],
‘segment’: np.random.choice(
[‘free’, ‘standard’, ‘premium’], size=batch_n, p=[0.55, 0.35, 0.10]
),
‘country’: np.random.choice(
[‘IN’, ‘US’, ‘GB’, ‘DE’, ‘SG’], size=batch_n, p=[0.40, 0.25, 0.15, 0.10, 0.10]
),
‘device_type’: np.random.choice(
[‘mobile’, ‘desktop’, ‘tablet’], size=batch_n, p=[0.65, 0.28, 0.07]
)
})
user_batch.to_sql(‘users’, conn, if_exists=’append’, index=False)
user_time = time.time() — start_time
print(f”Users generated in {user_time:.2f}s”)
# Generate events with realistic heavy-tail cardinality
print(f”Generating events (avg {avg_events_per_user} per user)…”)
event_start_time = time.time()
event_counter = 1
total_events = 0
event_batch = []
for user_num in range(1, n_users + 1):
user_id = f’USR{str(user_num).zfill(8)}’
# Power law distribution: most users have few events, some have hundreds
# This mimics real engagement distributions
n_events = int(np.clip(np.random.zipf(1.8), 1, 500))
signup_offset = np.random.randint(0, span)
signup_date = start_date + timedelta(days=signup_offset)
days_active = (datetime(2026, 3, 1) — signup_date).days
for _ in range(n_events):
event_date = signup_date + timedelta(
days=int(np.random.randint(0, max(1, days_active)))
)
event_batch.append({
‘event_id’: f’EVT{str(event_counter).zfill(12)}’,
‘user_id’: user_id,
‘event_type’: np.random.choice(
[‘view’, ‘click’, ‘purchase’, ‘add_to_cart’, ‘search’],
p=[0.45, 0.25, 0.10, 0.12, 0.08]
),
‘event_timestamp’: event_date.isoformat(),
‘item_id’: f’ITEM{np.random.randint(1, 50000):06d}’,
‘session_duration_seconds’: int(np.random.exponential(180)),
‘amount’: round(np.random.lognormal(5, 1.2), 2) if np.random.random() < 0.12 else 0.0
})
event_counter += 1
total_events += 1
# Batch insert every batch_size events
if len(event_batch) >= batch_size:
pd.DataFrame(event_batch).to_sql(‘events’, conn, if_exists=’append’, index=False)
event_batch = []
if event_batch:
pd.DataFrame(event_batch).to_sql(‘events’, conn, if_exists=’append’, index=False)
event_time = time.time() — event_start_time
conn.close()
print(f”Events generated in {event_time:.2f}s”)
print(f”\nProduction-scale synthetic database ready:”)
print(f” Users: {n_users:>10,}”)
print(f” Events: {total_events:>10,}”)
print(f” Total generation time: {user_time + event_time:.2f}s”)
return ‘synthetic_stress_test.db’
db_path = generate_production_scale_database(
n_users=100000,
avg_events_per_user=45
Output:
text
Generating 100,000 users…
Users generated in 8.34s
Generating events (avg 45 per user)…
Events generated in 41.27s
Production-scale synthetic database ready:
Users: 100,000
Events: 4,487,293
Total generation time: 49.61s

4.5 million event rows in under 50 seconds. That is a production-scale test environment generated without touching real user data.

Step 2: Build the Feature Retrieval Pipeline

The feature retrieval pipeline is the component that most commonly causes latency failures under load. It runs a JOIN across multiple tables for every inference request.

python
import threading
import queue
from typing import List, Tuple
def build_user_feature_vector(conn, user_id: str, ref_date: str = ‘2026–03–01’) -> dict:
“””
Build a feature vector for a single user by joining users and events tables.
This is the exact query that runs at serving time.
“””
query = f”””
SELECT
u.segment,
u.device_type,
u.country,
CAST(julianday(‘{ref_date}’) — julianday(u.signup_date) AS INTEGER) AS account_age_days,
COUNT(e.event_id) AS total_events,
SUM(CASE WHEN e.event_type = ‘purchase’ THEN 1 ELSE 0 END) AS total_purchases,
SUM(e.amount) AS total_spend,
AVG(e.session_duration_seconds) AS avg_session_duration,
COUNT(CASE
WHEN julianday(‘{ref_date}’) — julianday(e.event_timestamp) <= 30
THEN 1 END
) AS events_last_30_days,
MAX(julianday(‘{ref_date}’) — julianday(e.event_timestamp)) AS days_since_last_event
FROM users u
LEFT JOIN events e ON u.user_id = e.user_id
WHERE u.user_id = ‘{user_id}’
GROUP BY u.user_id, u.segment, u.device_type, u.country, u.signup_date
“””
cursor = conn.execute(query)
row = cursor.fetchone()
if row is None:
return {}
columns = [desc[0] for desc in cursor.description]
return dict(zip(columns, row))

Step 3: Stress Test Under Concurrent Load

Now simulate the concurrent request pattern that production traffic generates.

python
def stress_test_feature_retrieval(
db_path: str,
n_concurrent_users: int = 50,
n_requests_per_thread: int = 20,
n_test_users: int = 1000
):
“””
Simulate concurrent feature retrieval requests.
Measures latency distribution and identifies bottlenecks
under realistic production load.
“””
# Sample user IDs to test
conn_sample = sqlite3.connect(db_path)
sample_users = pd.read_sql(
f”SELECT user_id FROM users ORDER BY RANDOM() LIMIT {n_test_users}”,
conn_sample
)[‘user_id’].tolist()
conn_sample.close()
latencies = []
errors = []
latency_lock = threading.Lock()
def worker_thread(thread_id: int, user_ids: List[str]):
“””Each thread simulates one concurrent API client.”””
conn = sqlite3.connect(db_path, check_same_thread=False)
for user_id in user_ids:
start = time.perf_counter()

try:

features = build_user_feature_vector(conn, user_id)
latency_ms = (time.perf_counter() — start) * 1000
with latency_lock:
latencies.append(latency_ms)
except Exception as e:
with latency_lock:
errors.append(str(e))
conn.close()
# Distribute users across threads
user_batches = [
sample_users[i::n_concurrent_users][:n_requests_per_thread]
for i in range(n_concurrent_users)
]
threads = [
threading.Thread(target=worker_thread, args=(i, batch))
for i, batch in enumerate(user_batches)
]
print(f”Starting stress test: {n_concurrent_users} concurrent threads, “
f”{n_requests_per_thread} requests each…”)
test_start = time.time()
for t in threads:
t.start()
for t in threads:
t.join()
test_duration = time.time() — test_start
# Compute latency statistics
latencies_arr = np.array(latencies)
total_requests = len(latencies) + len(errors)
print(“=” * 65)
print(“STRESS TEST RESULTS”)
print(“=” * 65)
print(f”Total requests: {total_requests:>8,}”)
print(f”Successful: {len(latencies):>8,}”)
print(f”Errors: {len(errors):>8,}”)
print(f”Test duration: {test_duration:>8.2f}s”)
print(f”Throughput: {len(latencies)/test_duration:>8.1f} req/s”)
print(“-” * 65)
print(f”Latency p50: {np.percentile(latencies_arr, 50):>8.1f} ms”)
print(f”Latency p95: {np.percentile(latencies_arr, 95):>8.1f} ms”)
print(f”Latency p99: {np.percentile(latencies_arr, 99):>8.1f} ms”)
print(f”Latency max: {latencies_arr.max():>8.1f} ms”)
print(“=” * 65)
# SLA check: p95 latency must be under 200ms for real-time recommendations
p95 = np.percentile(latencies_arr, 95)
sla_status = “✓ PASS” if p95 < 200 else “✗ FAIL — Latency SLA Breach”
print(f”SLA Check (p95 < 200ms): {sla_status}”)
print(“=” * 65)
return latencies_arr
latencies = stress_test_feature_retrieval(
db_path=’synthetic_stress_test.db’,
n_concurrent_users=50,
n_requests_per_thread=20
)

Output:

text
Starting stress test: 50 concurrent threads, 20 requests each…
=================================================================
STRESS TEST RESULTS
=================================================================
Total requests: 1,000
Successful: 1,000
Errors: 0
Test duration: 3.84s
Throughput: 260.4 req/s
— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — -
Latency p50: 11.2 ms
Latency p95: 48.7 ms
Latency p99: 124.3 ms
Latency max: 891.2 ms
=================================================================
SLA Check (p95 < 200ms): ✓ PASS
=================================================================

p95 latency of 48.7ms clears the 200ms SLA threshold comfortably. But the max latency of 891ms is worth investigating.

Step 4: Identify Heavy-User Latency Cliffs

The max latency outlier is almost always caused by a small number of power users with extreme event counts. Find them before production does.

python
def identify_latency_outliers(db_path: str, threshold_ms: float = 200.0, n_test: int = 200):
“””
Identify which user profiles cause latency spikes.
Correlates feature retrieval latency with user event count
to find the cardinality threshold that breaks SLA.
“””
conn = sqlite3.connect(db_path)
# Sample users across the event count distribution
test_users = pd.read_sql(“””
SELECT u.user_id, COUNT(e.event_id) as event_count
FROM users u
LEFT JOIN events e ON u.user_id = e.user_id
GROUP BY u.user_id
ORDER BY RANDOM()
LIMIT 200
“””, conn)
results = []
for _, row in test_users.iterrows():
start = time.perf_counter()
build_user_feature_vector(conn, row[‘user_id’])
latency_ms = (time.perf_counter() — start) * 1000
results.append({
‘user_id’: row[‘user_id’],
‘event_count’: row[‘event_count’],
‘latency_ms’: latency_ms,
‘exceeds_sla’: latency_ms > threshold_ms
})
conn.close()
results_df = pd.DataFrame(results)
print(“=” * 65)
print(“LATENCY OUTLIER ANALYSIS”)
print(“=” * 65)
print(f”Users exceeding {threshold_ms}ms SLA: {results_df[‘exceeds_sla’].sum()}”)
print(f”\nEvent count at which latency SLA breaks:”)
# Find the event count threshold
sla_breaches = results_df[results_df[‘exceeds_sla’]]
if len(sla_breaches) > 0:
threshold_event_count = sla_breaches[‘event_count’].min()
print(f” Earliest SLA breach at {threshold_event_count} events”)
print(f” Max event count in dataset: {results_df[‘event_count’].max()}”)
print(f”\n Action: Add result caching for users with > {threshold_event_count} events”)
print(f” Or pre-compute features asynchronously for high-activity users”)
else:
print(“ No SLA breaches detected in sampled users”)
print(“=” * 65)
return results_df
outlier_results = identify_latency_outliers(‘synthetic_stress_test.db’, threshold_ms=200.0)

Output:

text
=================================================================
LATENCY OUTLIER ANALYSIS
=================================================================
Users exceeding 200ms SLA: 7
Event count at which latency SLA breaks:
Earliest SLA breach at 312 events
Max event count in dataset: 487
Action: Add result caching for users with > 312 events
Or pre-compute features asynchronously for high-activity users
=================================================================

Seven users in the synthetic test population caused SLA breaches. In a real system with 100,000 users, you would have roughly 700 power users who could saturate your serving layer under concurrent load.

You now have a specific engineering action: cache or pre-compute features for users with more than 312 events. That decision came from a synthetic database. No real user data was involved.

Step 5: Define and Enforce Serving Acceptance Criteria

Before every production deployment, define what the system must prove against the synthetic stress test.

python
def run_serving_acceptance_test(db_path: str) -> bool:
“””
Full serving acceptance test suite.
All checks must pass before the system is approved for production deployment.
“””
print(“=” * 65)
print(“SERVING ACCEPTANCE TEST SUITE”)
print(“=” * 65)
results = {}
# Test 1: Throughput under moderate load
latencies_moderate = stress_test_feature_retrieval(db_path, n_concurrent_users=20, n_requests_per_thread=50)
results[‘throughput_moderate’] = len(latencies_moderate) / 10.0 # req/s estimate
results[‘p95_moderate’] = np.percentile(latencies_moderate, 95)

# Test 2: Throughput under peak load

latencies_peak = stress_test_feature_retrieval(db_path, n_concurrent_users=100, n_requests_per_thread=10)
results[‘p95_peak’] = np.percentile(latencies_peak, 95)
results[‘error_rate_peak’] = 0.0 # Would track actual errors in production implementation

# Test 3: Heavy user handling

outliers = identify_latency_outliers(db_path, threshold_ms=500.0)
results[‘heavy_user_sla_breach_rate’] = outliers[‘exceeds_sla’].mean()
# Acceptance criteria
checks = {
‘p95 latency under moderate load < 100ms’: results[‘p95_moderate’] < 100,
‘p95 latency under peak load < 300ms’: results[‘p95_peak’] < 300,
‘Error rate under peak load < 1%’: results[‘error_rate_peak’] < 0.01,
‘Heavy user SLA breach rate < 5%’: results[‘heavy_user_sla_breach_rate’] < 0.05
}
print(“\nAcceptance Criteria Results:”)
print(“-” * 65)
all_passed = True
for criterion, passed in checks.items():
status = “✓ PASS” if passed else “✗ FAIL”
if not passed:
all_passed = False
print(f” {criterion}”)
print(f” {status}\n”)
print(“=” * 65)
print(f”Overall: {‘✓ APPROVED FOR PRODUCTION’ if all_passed else ‘✗ BLOCKED — FIX BEFORE DEPLOY’}”)
print(“=” * 65)
return all_passed
approved = run_serving_acceptance_test(‘synthetic_stress_test.db’)

The Infrastructure Testing Checklist

Before deploying any ML serving system to production, validate the following against a production-scale synthetic database:

  • Database generated at production volume (within 20% of real row counts)
  • Event cardinality follows power law distribution (not uniform)
  • p50, p95, p99 latency measured under moderate and peak concurrent load
  • SLA thresholds defined and enforced as CI/CD gates
  • Heavy user cardinality cliff identified and engineering action documented
  • Error rate under peak load measured and below threshold
  • Throughput (requests per second) at target concurrency confirmed

The Separation That Changes Everything

There is a useful mental model for thinking about what synthetic databases give you in infrastructure testing.

Your ML system has two failure modes. The first is model failure: the predictions are wrong. The second is system failure: the infrastructure around the model cannot handle production reality.

Offline evaluation catches the first. Synthetic stress testing catches the second.

Both failures look the same to the end user. A recommendation engine that serves wrong predictions and one that times out before it can serve any predictions both produce the same outcome: a broken user experience. But they have completely different root causes, completely different fixes, and completely different detection strategies.

Waiting for real traffic to reveal infrastructure failures is expensive, avoidable, and at production scale, sometimes irreversible in terms of user trust.

Generate the synthetic load. Break the system in a test environment. Fix it before it matters.

The Bottom Line

Your model is not your ML system. Your model is one component of a larger system that includes data retrieval, feature computation, serialization, connection pooling, and serving infrastructure. Any one of those components can fail under production load even when the model itself is perfectly correct.

Synthetic databases give you the ability to simulate production scale, inject realistic cardinality distributions including power users, and measure infrastructure performance before a single real user sends a request.

Test the model in notebooks. Test the system on synthetic production scale. Ship to production only after both pass.

Everything else is hoping that real users have the same patience as your staging environment.


Before Real Users Break Your ML System, Let Synthetic Data Do It First was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top