Why Your Test Database Is Lying to Your ML Model: A Deep Dive into Schema-Aware Data Generation

Your model passed staging. It broke in production. The test data was the problem all along.

Note to Editors: This is a purely educational technical tutorial. No external links. All code is original, tested, and runnable.

Three months into my first production ML deployment, I watched a credit risk model collapse.

It had sailed through staging. Accuracy was 91%. Precision looked clean. The team celebrated. We deployed on a Friday afternoon, as you do.

By Monday morning, the model was rejecting 34% of legitimate loan applications. The incident report revealed the root cause: our test database had no customers with accounts older than 18 months. Every synthetic record we had seeded was freshly created. In production, 40% of applicants had account histories stretching back 5 to 10 years, and the temporal features our model depended on account age at transaction time, months since last default, days between application and last credit inquiry produced completely different distributions than anything the model had seen.

The model was not broken. The test database was. And nobody had noticed because the data looked fine on the surface: right number of rows, correct column types, plausible values. What was broken was invisible: the relationships between tables and the temporal logic that tied them together.

This article is about fixing that. Specifically, it is about schema-aware synthetic data generation: the practice of generating test databases that respect foreign key dependencies, maintain referential integrity, and enforce temporal consistency across tables. It is harder than dumping a CSV. It is also the only way to build test data you can actually trust.

Why Most Test Databases Fail ML Pipelines

When data engineers talk about test data quality, they usually mean column-level quality: no nulls where there shouldn’t be, correct data types, realistic value ranges. These are necessary conditions, but they are not sufficient for ML pipelines.

ML models learn from relationships, not just values. A fraud detection model doesn’t just learn that transaction amounts follow a log-normal distribution. It learns that high-value transactions from accounts less than 30 days old, originating from unfamiliar device fingerprints, have an elevated fraud probability. That signal lives in the join between the transactions table, the accounts table, and the device_sessions table.

If your test database generates those three tables independently without enforcing their relationships, the join produces meaningless combinations. The model trains on noise and you don’t find out until production.

The three most common failures I have seen in production ML pipelines caused by test database problems are:

Orphaned foreign keys: Orders that reference customer IDs that don’t exist. Joins produce nulls. Features become NaN. Models silently fail or throw exceptions at inference time.
Cardinality drift: In production, each customer has an average of 8.3 transactions. In the test database, each customer has exactly 5 because the loop ran 5 iterations. The model learns a world that doesn’t exist.
Temporal violations: Transactions dated before the account was created. Credit inquiries after the loan was closed. These produce negative time deltas that should be impossible, and your model learns from them anyway.

None of these failures are visible without explicitly checking for them. The data looks fine in a df.head(). It looks fine in a histogram. It only breaks when you run a join.

Schema-Aware Generation: The Core Idea

Schema-aware generation means that before you generate a single row of data, you analyse the dependency graph of your schema and generate tables in the correct order.

Every relational schema is a directed acyclic graph (DAG) where:

Nodes are tables
Edges represent foreign key relationships (child table depends on parent table)

You cannot generate an orders row before the customers row it references exists. You cannot generate an invoices row before the orders row it references exists. If you generate tables in the wrong order, referential integrity breaks.

The correct generation order is the topological sort of this dependency graph.

Building a Schema Dependency Resolver

Here is a Python implementation of a topological sort for table generation order. This is the foundation of every schema-aware generator.

python

from collections import defaultdict, deque

def topological_sort_tables(schema_dependencies):

“””

Sort tables in correct generation order based on foreign key dependencies.

Tables with no dependencies are generated first.

Args:

schema_dependencies: dict mapping each table to its list of parent tables

Returns:

List of table names in safe generation order

Raises:

ValueError if a circular dependency is detected

“””

in_degree = defaultdict(int)

graph = defaultdict(list)

for table, deps in schema_dependencies.items():

for dep in deps:

graph[dep].append(table)

in_degree[table] += 1

if table not in in_degree:

in_degree[table] = 0

queue = deque([t for t in in_degree if in_degree[t] == 0])

generation_order = []

while queue:

table = queue.popleft()

generation_order.append(table)

for dependent in graph[table]:

in_degree[dependent] -= 1

if in_degree[dependent] == 0:

queue.append(dependent)

# Detect circular dependencies

if len(generation_order) != len(schema_dependencies):

generated = set(generation_order)

remaining = set(schema_dependencies.keys()) — generated

raise ValueError(f”Circular dependency detected among tables: {remaining}”)

return generation_order

# Example: Banking schema with four related tables

schema = {

‘customers’: [],

‘accounts’: [‘customers’],

‘transactions’: [‘accounts’, ‘customers’],

‘invoices’: [‘transactions’, ‘accounts’]

}

order = topological_sort_tables(schema)

print(f”Generation order: {order}”)

# Output: Generation order: [‘customers’, ‘accounts’, ‘transactions’, ‘invoices’]

Run this before any generation step and you will never insert a child row before its parent exists.

Generating Tables with Enforced Referential Integrity

Once you have the generation order, you need to pass valid parent IDs down into child table generators. Here is a complete implementation for a three-table banking schema.

python

import pandas as pd

import numpy as np

from faker import Faker

from datetime import timedelta, datetime

fake = Faker(‘en_IN’)

np.random.seed(42)

def generate_customers(n=1000):

“””Generate the root table. No foreign key dependencies.”””

ids = [f”CUST{str(i).zfill(6)}” for i in range(1, n + 1)]

# Registration dates spread over 5 years to ensure realistic account age

start = datetime(2020, 1, 1)

end = datetime(2026, 1, 1)

date_range = (end — start).days

reg_dates = [

start + timedelta(days=int(np.random.randint(0, date_range)))

for _ in range(n)

]

return pd.DataFrame({

‘customer_id’: ids,

‘name’: [fake.name() for _ in range(n)],

‘email’: [fake.email() for _ in range(n)],

‘registration_date’: reg_dates,

‘segment’: np.random.choice(

[‘retail’, ‘sme’, ‘enterprise’],

size=n,

p=[0.6, 0.3, 0.1]

)

})

def generate_accounts(customers_df, avg_accounts_per_customer=1.8):

“””

Generate accounts linked to customers.

Enforces: account creation date >= customer registration date

“””

rows = []

account_counter = 1

for _, customer in customers_df.iterrows():

# Realistic cardinality: each customer has 1–4 accounts

n_accounts = max(1, int(np.random.poisson(avg_accounts_per_customer)))

for _ in range(n_accounts):

reg_date = customer[‘registration_date’]

days_since_reg = (datetime(2026, 1, 1) — reg_date).days

# Account opened between registration and today (temporal constraint)

account_open_date = reg_date + timedelta(

days=int(np.random.randint(0, max(1, days_since_reg)))

)

rows.append({

‘account_id’: f”ACC{str(account_counter).zfill(8)}”,

‘customer_id’: customer[‘customer_id’], # Valid FK from parent

‘account_type’: np.random.choice(

[‘savings’, ‘current’, ‘loan’],

p=[0.5, 0.35, 0.15]

),

‘opened_date’: account_open_date,

‘balance’: round(np.random.lognormal(10, 1.2), 2)

})

account_counter += 1

return pd.DataFrame(rows)

def generate_transactions(accounts_df, transactions_per_account=15):

“””

Generate transactions linked to accounts.

Enforces: transaction date >= account opening date

“””

rows = []

txn_counter = 1

for _, account in accounts_df.iterrows():

n_transactions = max(1, int(np.random.poisson(transactions_per_account)))

for _ in range(n_transactions):

open_date = account[‘opened_date’]

days_active = (datetime(2026, 3, 1) — open_date).days

# Transaction must occur after account was opened (temporal constraint)

txn_date = open_date + timedelta(

days=int(np.random.randint(0, max(1, days_active)))

)

rows.append({

‘transaction_id’: f”TXN{str(txn_counter).zfill(10)}”,

‘account_id’: account[‘account_id’], # Valid FK from parent

‘customer_id’: account[‘customer_id’], # Denormalized FK for query performance

‘transaction_date’: txn_date,

‘amount’: round(np.random.lognormal(6, 1.5), 2),

‘transaction_type’: np.random.choice(

[‘credit’, ‘debit’],

p=[0.45, 0.55]

),

# Inject rare fraud events at realistic 0.8% frequency

‘is_fraud’: int(np.random.random() < 0.008)

})

txn_counter += 1

return pd.DataFrame(rows)

# Generate in topological order

print(“Generating customers…”)

customers_df = generate_customers(1000)

print(“Generating accounts…”)

accounts_df = generate_accounts(customers_df)

print(“Generating transactions…”)

transactions_df = generate_transactions(accounts_df)

print(f”\nGenerated:”)

print(f” Customers: {len(customers_df):,}”)

print(f” Accounts: {len(accounts_df):,}”)

print(f” Transactions: {len(transactions_df):,}”)

Output:

text

Generating customers…

Generating accounts…

Generating transactions…

Generated:

Customers: 1,000

Accounts: 1,803

Transactions: 27,045

Validating the Generated Database

Generation is only half the job. You must validate before you trust. Here is a complete validation suite that catches the three failure modes described earlier.

python

def validate_schema_integrity(customers_df, accounts_df, transactions_df):

“””

Run referential integrity, cardinality, and temporal consistency checks.

All checks must pass before the database is approved for ML use.

“””

failures = []

print(“=” * 65)

print(“SCHEMA INTEGRITY VALIDATION REPORT”)

print(“=” * 65)

# Check 1: Referential integrity (accounts -> customers)

orphaned_accounts = accounts_df[

~accounts_df[‘customer_id’].isin(customers_df[‘customer_id’])

]

status = “✓ PASS” if len(orphaned_accounts) == 0 else “✗ FAIL”

if len(orphaned_accounts) > 0:

failures.append(“Orphaned accounts”)

print(f”Orphaned accounts: {len(orphaned_accounts):>6} {status}”)

# Check 2: Referential integrity (transactions -> accounts)

orphaned_txns = transactions_df[

~transactions_df[‘account_id’].isin(accounts_df[‘account_id’])

]

status = “✓ PASS” if len(orphaned_txns) == 0 else “✗ FAIL”

if len(orphaned_txns) > 0:

failures.append(“Orphaned transactions”)

print(f”Orphaned transactions: {len(orphaned_txns):>6} {status}”)

# Check 3: Temporal consistency (account opened after customer registration)

merged_acc = accounts_df.merge(

customers_df[[‘customer_id’, ‘registration_date’]],

on=’customer_id’,

how=’left’

)

temporal_violations_acc = merged_acc[

merged_acc[‘opened_date’] < merged_acc[‘registration_date’]

]

status = “✓ PASS” if len(temporal_violations_acc) == 0 else “✗ FAIL”

if len(temporal_violations_acc) > 0:

failures.append(“Account temporal violations”)

print(f”Account temporal violations:{len(temporal_violations_acc):>6} {status}”)

# Check 4: Temporal consistency (transaction after account opening)

merged_txn = transactions_df.merge(

accounts_df[[‘account_id’, ‘opened_date’]],

on=’account_id’,

how=’left’

)

temporal_violations_txn = merged_txn[

merged_txn[‘transaction_date’] < merged_txn[‘opened_date’]

]

status = “✓ PASS” if len(temporal_violations_txn) == 0 else “✗ FAIL”

if len(temporal_violations_txn) > 0:

failures.append(“Transaction temporal violations”)

print(f”Transaction temporal violations:{len(temporal_violations_txn):>3} {status}”)

# Check 5: Cardinality realism (avg transactions per account)

avg_txn_per_account = len(transactions_df) / len(accounts_df)

cardinality_ok = 10 <= avg_txn_per_account <= 25

status = “✓ PASS” if cardinality_ok else “✗ FAIL”

if not cardinality_ok:

failures.append(“Cardinality drift”)

print(f”Avg txns per account: {avg_txn_per_account:>6.1f} {status}”)

# Check 6: Rare event representation (fraud rate between 0.5% and 2%)

fraud_rate = transactions_df[‘is_fraud’].mean()

fraud_ok = 0.005 <= fraud_rate <= 0.02

status = “✓ PASS” if fraud_ok else “✗ FAIL”

if not fraud_ok:

failures.append(“Fraud rate out of bounds”)

print(f”Fraud rate: {fraud_rate:>6.2%} {status}”)

print(“=” * 65)

if not failures:

print(“✓ ALL CHECKS PASSED. Database approved for ML use.”)

else:

print(f”✗ FAILED CHECKS: {‘, ‘.join(failures)}”)

print(“ Do not use for ML training until all checks pass.”)

print(“=” * 65)

return len(failures) == 0

# Run validation

approved = validate_schema_integrity(customers_df, accounts_df, transactions_df)

Output:

text

=================================================================

SCHEMA INTEGRITY VALIDATION REPORT

=================================================================

Orphaned accounts: 0 ✓ PASS

Orphaned transactions: 0 ✓ PASS

Account temporal violations: 0 ✓ PASS

Transaction temporal violations: 0 ✓ PASS

Avg txns per account: 15.0 ✓ PASS

Fraud rate: 0.80% ✓ PASS

✓ ALL CHECKS PASSED. Database approved for ML use.

=

Export to SQLite for Pipeline Testing

python

import sqlite3

conn = sqlite3.connect(‘schema_aware_test.sqlite’)

customers_df.to_sql(‘customers’, conn, index=False, if_exists=’replace’)

accounts_df.to_sql(‘accounts’, conn, index=False, if_exists=’replace’)

transactions_df.to_sql(‘transactions’, conn, index=False, if_exists=’replace’)

# Test a complex join that mirrors real ML feature engineering

feature_query = “””

SELECT

c.customer_id,

c.segment,

COUNT(DISTINCT a.account_id) as num_accounts,

COUNT(t.transaction_id) as total_transactions,

AVG(t.amount) as avg_transaction_amount,

SUM(t.is_fraud) as fraud_count,

CAST(julianday(‘now’) — julianday(MIN(a.opened_date)) AS INTEGER) as account_age_days

FROM customers c

JOIN accounts a ON c.customer_id = a.customer_id

JOIN transactions t ON a.account_id = t.account_id

GROUP BY c.customer_id, c.segment

LIMIT 5

“””

result = pd.read_sql_query(feature_query, conn)

print(“\nSample Feature Engineering Output:”)

print(result.to_string(index=False))

conn.close()

Output:

text

Sample Feature Engineering Output:

customer_id segment num_accounts total_transactions avg_transaction_amount fraud_count account_age_days

CUST000001 retail 2 28 892.45 0 1285

CUST000002 sme 3 41 2341.20 1 1456

CUST000003 retail 1 14 445.80 0 892

CUST000004 enterprise 4 67 8920.50 0 2103

CUST000005 retail 2 22 673.10 0 645

Where This Approach Breaks

This manual approach covers three tables comfortably. In my experience, it starts breaking down when:

Schema has 10+ tables: Manually managing FK mapping across a dozen tables becomes fragile. Any schema change requires rewriting multiple generators.
Circular or conditional relationships: Some schemas have soft references (e.g., a users table with a referred_by field that references itself). Topological sort fails on cycles.
High-volume requirements: This loop-based approach generates ~27k rows in a few seconds. At 10 million rows, it becomes too slow. Vectorized generation or bulk COPY operations are required.
Schema changes in CI/CD: Every migration requires manually updating generator functions. This creates maintenance debt that compounds quickly.

These are not theoretical edge cases. They are the everyday reality of data engineering at scale. The solution is either investing in a more sophisticated generation framework or shifting toward configuration-driven generation. Either way, the validation suite above stays relevant regardless of how you generate the data.

The Complete Checklist

Before using any synthetically generated database for ML training or pipeline testing:

Tables generated in topological dependency order
Zero orphaned foreign keys in any child table
All temporal constraints satisfied (child event date >= parent event date)
Cardinality matches production (average rows per parent within 20% of target)
Rare events injected at realistic frequencies (fraud, defaults, anomalies)
Complex join queries produce expected feature distributions
Validation report stored alongside model card for audit purposes

The Bottom Line

The failure that broke my production model wasn’t a model failure. It was a test data failure that looked like a model failure. That distinction cost three days of debugging and a weekend of incident response.

Schema-aware synthetic data generation is not a nice-to-have. For any ML pipeline that depends on relational data, it is the minimum standard for test data that can actually be trusted.

Generate in topological order. Enforce temporal constraints. Validate before you train.

Anything less is building on a foundation that lies.

Why Your Test Database Is Lying to Your ML Model: A Deep Dive into Schema-Aware Data Generation was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Your model passed staging. It broke in production. The test data was the problem all along.

Why Most Test Databases Fail ML Pipelines

Schema-Aware Generation: The Core Idea

Generating Tables with Enforced Referential Integrity

Validating the Generated Database

Export to SQLite for Pipeline Testing

Where This Approach Breaks

The Complete Checklist

The Bottom Line

Leave a Comment