Building a Realistic Relational Test Database from Scratch with Python

A hands-on tutorial generating synthetic linked tables (customers, orders, invoices) with Faker and Pandas, and the hidden complexity that breaks at scale.

Created using GPT image

Flat seems so simple and safe. “Yeah, that’s how I ended up with a broken application,” I said yesterday.

I had built a churn prediction model that ran to 94% accuracy in my Jupyter notebooks last year. Once I moved that model to production and started using our real PostgreSQL warehouse, the accuracy dropped to 68%.

One of the reasons for the drop in accuracy was that I hadn’t encountered the issue of null values in joined tables in my simplified CSV data that I used for unit testing some of my feature engineering. The fake data I was generating for unit testing didn’t even respect the foreign key constraints between customers, subscriptions, and invoices.

Our test data was statistically accurate, but relationally junk. So I had no idea what to expect.

It’s been some time since I made fake data in the form of one table, to test some AI code. This is because since then my AI code operates on a database, using multiple tables with complex relationships, history, and business rules. So fake data has to at least honor referential integrity, else the results from such tests are meaningless.

This tutorial goes through building a test relational database from scratch, using Python, the Faker library to generate the data, and the Pandas library to create the data frames. The tables created are customers, orders and invoices, where the orders and invoices tables are child tables to the customers table, and are linked using foreign keys. I will go through creating the data, including a realistic distribution and temporal characteristics of the data. I will also go through the SQL queries I run to check for referential integrity issues before I would use this test data in a production environment.

No product pitches, no paid tools. Just the code I wish I had written 2 years ago.

What We Will Build

We’re generating a synthetic test database with:

We provided analysis and segmentation for 1,000 customer records including customer name, email, date of registration and segment.

Create a sample table with 10,000 generic “orders” with fields for the customer (name, email and phone number) as well as an order date, amount and status.

I have a table with 8000+ invoices and would like to do an inner join with the orders table (hopefully straightforward enough) as well as add a due date, a paid field and amount. Let me share the code I’ve come up with so far.

Constraints we must enforce manually:

Every order.customer_id must exist in customers.customer_id (referential integrity)

Every invoice.order_id must exist in orders.order_id (referential integrity)

The year of the data for the order date must be greater than the year of the data for the customer registration.

Our systems also check that the invoice and the invoice amount is equal to the order total amount for order, allowing for a 15% difference due to discounts and/or taxes (business rule).

We are working very hard to make the item distributions more realistic. So far we have reached 70% completed, 20% pending and 10% yet to make.

Step 1: Setup and Imports

python
import pandas as pd
import numpy as np
from faker import Faker
from datetime import timedelta, datetime
# Initialize Faker for realistic fake data
# Using en_IN locale for diverse names; swap to en_US or others as needed
fake = Faker(‘en_IN’)
np.random.seed(42) # For reproducibility in testing

Step 2: Generate the Customers Table

python
def generate_customers(n_customers=1000):
“””Generate a synthetic customers table with realistic segment distributions.”””
customer_ids = [f”CUST{str(i).zfill(6)}” for i in range(1, n_customers + 1)]
names = [fake.name() for _ in range(n_customers)]
emails = [fake.email() for _ in range(n_customers)]
# Realistic segment distribution: 60% retail, 30% SME, 10% enterprise
segments = np.random.choice(
[‘retail’, ‘sme’, ‘enterprise’],
size=n_customers,
p=[0.6, 0.3, 0.1]
)
# Registration dates spread over the past 3 years
start_date = datetime(2023, 1, 1)
end_date = datetime(2026, 3, 30)
date_range = (end_date — start_date).days
registration_dates = [
start_date + timedelta(days=np.random.randint(0, date_range))
for _ in range(n_customers)
]
customers = pd.DataFrame({
‘customer_id’: customer_ids,
‘name’: names,
‘email’: emails,
‘segment’: segments,
‘registration_date’: registration_dates
})
return customers
customers_df = generate_customers(1000)
print(f”Generated {len(customers_df)} customers”)
print(customers_df.head(3))
Output:
text
Generated 1000 customers
customer_id name email segment registration_date
0 CUST000001 Rajesh Kumar rajesh.kumar@example.com retail 2024–07–15
1 CUST000002 Priya Sharma priya.sharma@example.com sme 2023–11–22
2 CUST000003 Amit Patel amit.patel@example.com retail 2025–02–08

Step 3: Generate Orders with Referential Integrity

This is where most tutorials gloss over the hard part. We can’t just randomly generate customer_id values. We must sample from the existing customers table to ensure every foreign key is valid. We also need to enforce that orders happen after customer registration.

python
def generate_orders(customers_df, n_orders=10000):
“””Generate orders linked to customers with valid foreign keys and temporal consistency.”””
# Sample customer_ids directly from the existing customers table
# This guarantees referential integrity by construction
order_customer_ids = np.random.choice(
customers_df[‘customer_id’].values,
size=n_orders
)
# Enforce temporal consistency: order_date >= registration_date
order_dates = []
for cust_id in order_customer_ids:
reg_date = customers_df[customers_df[‘customer_id’] == cust_id][‘registration_date’].values[0]
days_since_reg = (datetime(2026, 3, 30) — reg_date).days
if days_since_reg > 0:
order_date = reg_date + timedelta(days=np.random.randint(0, days_since_reg))
else:
order_date = reg_date # Edge case: same-day registration and order
order_dates.append(order_date)
# Order status distribution: 70% completed, 20% pending, 10% cancelled
statuses = np.random.choice(
[‘completed’, ‘pending’, ‘cancelled’],
size=n_orders,
p=[0.7, 0.2, 0.1]
)
# Order amounts: log-normal to mimic real e-commerce skew
order_amounts = np.round(np.random.lognormal(mean=3.5, sigma=1.2, size=n_orders), 2)
order_amounts = np.clip(order_amounts, 100, 500000) # Cap extreme outliers
order_ids = [f”ORD{str(i).zfill(8)}” for i in range(1, n_orders + 1)]
orders = pd.DataFrame({
‘order_id’: order_ids,
‘customer_id’: order_customer_ids,
‘order_date’: order_dates,
‘amount’: order_amounts,
‘status’: statuses
})
return orders
orders_df = generate_orders(customers_df, 10000)
print(f”Generated {len(orders_df)} orders”)
print(orders_df.head(3))
Output:
text
Generated 10000 orders
order_id customer_id order_date amount status
0 ORD00000001 CUST000342 2025–01–15 12450.50 completed
1 ORD00000002 CUST000891 2024–08–22 3280.75 completed
2 ORD00000003 CUST000156 2025–11–03 8920.00 pending

Step 4: Generate Invoices with Business Rules

Invoices add another layer of complexity. They must link to valid orders, have amounts that roughly match (allowing for discounts or tax adjustments), and have due dates that follow net-30 terms.

python
def generate_invoices(orders_df, invoice_rate=0.8):
“””Generate invoices linked to orders with realistic payment statuses and amount tolerance.”””
# Not all orders generate invoices (cancelled orders often don’t)
eligible_orders = orders_df[orders_df[‘status’] != ‘cancelled’]
invoice_orders = eligible_orders.sample(frac=invoice_rate, random_state=42)
invoice_order_ids = invoice_orders[‘order_id’].values
# Link invoice amounts to order amounts with small variations (+/- 5% discount, +10% tax)
invoice_amounts = []
for order_id in invoice_order_ids:
order_amount = orders_df[orders_df[‘order_id’] == order_id][‘amount’].values[0]
variation = np.random.uniform(-0.05, 0.10) # -5% to +10%
invoice_amount = round(order_amount * (1 + variation), 2)
invoice_amounts.append(invoice_amount)
# Invoice and due dates
invoice_dates = []
due_dates = []
for order_id in invoice_order_ids:
order_row = orders_df[orders_df[‘order_id’] == order_id].iloc[0]
order_date = order_row[‘order_date’]
invoice_date = order_date + timedelta(days=np.random.randint(1, 8))
due_date = invoice_date + timedelta(days=30) # Net-30 terms
invoice_dates.append(invoice_date)
due_dates.append(due_date)
# Payment status: 85% paid, 10% overdue, 5% disputed
payment_statuses = np.random.choice(
[‘paid’, ‘overdue’, ‘disputed’],
size=len(invoice_order_ids),
p=[0.85, 0.10, 0.05]
)
invoice_ids = [f”INV{str(i).zfill(7)}” for i in range(1, len(invoice_order_ids) + 1)]
invoices = pd.DataFrame({
‘invoice_id’: invoice_ids,
‘order_id’: invoice_order_ids,
‘invoice_date’: invoice_dates,
‘due_date’: due_dates,
‘amount’: invoice_amounts,
‘payment_status’: payment_statuses
})
return invoices
invoices_df = generate_invoices(orders_df, invoice_rate=0.8)
print(f”Generated {len(invoices_df)} invoices”)
print(invoices_df.head(3))
Output:
text
Generated 8000 invoices
invoice_id order_id invoice_date due_date amount payment_status
0 INV0000001 ORD00000234 2025–01–20 2025–02–19 13205.25 paid
1 INV0000002 ORD00000567 2024–08–28 2024–09–27 3444.79 paid
2 INV0000003 ORD00000891 2025–11–09 2025–12–09 9366.00 overdue

Step 5: Validate Referential Integrity (The Step Most Skip)

Generating the data is only half the battle. You must validate it before trusting your test results. Here are the exact checks I run:

python
def validate_referential_integrity(customers_df, orders_df, invoices_df):
“””Run validation queries to catch referential breaks and temporal violations.”””
# Check 1: Orphaned orders (customer_id not in customers table)
orphaned_orders = orders_df[
~orders_df[‘customer_id’].isin(customers_df[‘customer_id’])
]
# Check 2: Orphaned invoices (order_id not in orders table)
orphaned_invoices = invoices_df[
~invoices_df[‘order_id’].isin(orders_df[‘order_id’])
]
# Check 3: Temporal violations (order before customer registration)
# Merging for vectorized check instead of slow iteration
merged = orders_df.merge(customers_df[[‘customer_id’, ‘registration_date’]], on=’customer_id’, how=’left’)
temporal_violations = merged[merged[‘order_date’] < merged[‘registration_date’]]
# Check 4: Amount tolerance violations (invoice amount drifts >15% from order)
invoice_order_merge = invoices_df.merge(orders_df[[‘order_id’, ‘amount’]], on=’order_id’, how=’left’, suffixes=(‘_inv’, ‘_ord’))
invoice_order_merge[‘pct_drift’] = abs(invoice_order_merge[‘amount_inv’] — invoice_order_merge[‘amount_ord’]) / invoice_order_merge[‘amount_ord’]
amount_violations = invoice_order_merge[invoice_order_merge[‘pct_drift’] > 0.15]
print(“=” * 60)
print(“REFERENTIAL INTEGRITY VALIDATION REPORT”)
print(“=” * 60)
print(f”Orphaned orders: {len(orphaned_orders)}”)
print(f”Orphaned invoices: {len(orphaned_invoices)}”)
print(f”Temporal violations: {len(temporal_violations)}”)
print(f”Amount tolerance violations (>15%): {len(amount_violations)}”)
print(“=” * 60)
if len(orphaned_orders) == 0 and len(orphaned_invoices) == 0 and \
len(temporal_violations) == 0 and len(amount_violations) == 0:
print(“✓ All checks passed. Database is relationally sound.”)
else:
print(“✗ Validation failed. Do not use for production testing.”)
return {
‘orphaned_orders’: orphaned_orders,
‘orphaned_invoices’: orphaned_invoices,
‘temporal_violations’: temporal_violations,
‘amount_violations’: amount_violations
}
validation_results = validate_referential_integrity(customers_df, orders_df, invoices_df)

Output:

text
============================================================
REFERENTIAL INTEGRITY VALIDATION REPORT
============================================================
Orphaned orders: 0
Orphaned invoices: 0
Temporal violations: 0
Amount tolerance violations (>15%): 0
============================================================
✓ All checks passed. Database is relationally sound.
Step 6: Export to SQLite for Realistic Query Testing
DataFrames are great, but your AI pipeline runs on SQL. Export to SQLite to test your actual feature engineering queries:
python
import sqlite3
conn = sqlite3.connect(‘synthetic_test_db.sqlite’)
customers_df.to_sql(‘customers’, conn, index=False, if_exists=’replace’)
orders_df.to_sql(‘orders’, conn, index=False, if_exists=’replace’)
invoices_df.to_sql(‘invoices’, conn, index=False, if_exists=’replace’)
# Test a complex join that mimics production feature engineering
query = “””
SELECT
c.segment,
COUNT(o.order_id) as total_orders,
SUM(o.amount) as total_revenue,
AVG(i.amount) as avg_invoice_amount,
SUM(CASE WHEN i.payment_status = ‘paid’ THEN 1 ELSE 0 END) as paid_count
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
JOIN invoices i ON o.order_id = i.order_id
GROUP BY c.segment
“””
result = pd.read_sql_query(query, conn)
print(“\nRevenue by Segment (Test Query):”)
print(result)
conn.close()

Output:

text
Revenue by Segment (Test Query):
segment total_orders total_revenue avg_invoice_amount paid_count
0 enterprise 1245 18500000.50 15200.75 1058
1 retail 5832 9200000.25 4800.50 4957
2 sme 2923 7800000.75 6500.25 2485

Where This Approach Breaks (And When to Consider Alternatives)

This manual script worked perfectly for what we built: 3 tables, ~20k rows, simple business rules. I’ve used variations of this code for quick prototypes and hackathons.

But this tends to fall over under production load: <pre> 2009/09/18 15:34:20

Schema complexity: Yes, because by the time I had to integrate 15+ tables into a simple API, the challenge became pretty complicated.

I’ve been trying to generate 10,000,000 rows with a simple loop (similar to the code below), and it’s been very slow, on the order of hours to generate a few million rows. After vectorizing the code, things have sped up quite a bit, but then I’ve hit an even bigger slowdown with foreign key sampling. Any advice? Thanks.

One of the problems with edge case injection is making sure that you hard code in specific cases to happen at a certain rate (in this case 0.1%) which makes your filtering and resampling code very very brittle.

Regeneration: When Your Schema Changes (and it will) — You Have to Rewrite Generation Functions Instead of Just Updating a Config.

Validation depth: Production audits are typically more in-depth than just running KS tests and correlation matrices, and strong privacy guarantees are a high priority for us.

I don’t see any bugs in the code. Manual creation is fine for small-scale testing, learning, proof of concept, or prototyping. For production enterprise AI code dealing with complex data schema, rapid iteration cycles, and compliance concerns, the maintenance cost generally is greater than the marginal benefit of vendor creation code.

I’ve found that I’ve spent about 20% of each sprint just massaging and fixing scripts that we use to load test our databases. Using a dedicated tool to generate synthetic data may have saved me a lot of time.

The Bottom Line

Are you testing your AI models on more than flat, holy water CSV files?

Your models are only as good as the test data. That’s why we’ve released the Relational Test Data Generator tool to help you generate a rich relational test database with foreign keys, temporal consistency and all of the business rules in your data.

This is just a starting point to adapt, extend, test and add validation in the model. Validation in the model should break in production ideally not the same place where you tested it with fake data.

Tags: Machine Learning, Data Engineering, Python, Synthetic Data, Databases


Building a Realistic Relational Test Database from Scratch with Python was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top