Building a Realistic Relational Test Database from Scratch with Python

A hands-on tutorial generating synthetic linked tables (customers, orders, invoices) with Faker and Pandas, and the hidden complexity that breaks at scale.

Flat seems so simple and safe. “Yeah, that’s how I ended up with a broken application,” I said yesterday.

I had built a churn prediction model that ran to 94% accuracy in my Jupyter notebooks last year. Once I moved that model to production and started using our real PostgreSQL warehouse, the accuracy dropped to 68%.

One of the reasons for the drop in accuracy was that I hadn’t encountered the issue of null values in joined tables in my simplified CSV data that I used for unit testing some of my feature engineering. The fake data I was generating for unit testing didn’t even respect the foreign key constraints between customers, subscriptions, and invoices.

Our test data was statistically accurate, but relationally junk. So I had no idea what to expect.

It’s been some time since I made fake data in the form of one table, to test some AI code. This is because since then my AI code operates on a database, using multiple tables with complex relationships, history, and business rules. So fake data has to at least honor referential integrity, else the results from such tests are meaningless.

This tutorial goes through building a test relational database from scratch, using Python, the Faker library to generate the data, and the Pandas library to create the data frames. The tables created are customers, orders and invoices, where the orders and invoices tables are child tables to the customers table, and are linked using foreign keys. I will go through creating the data, including a realistic distribution and temporal characteristics of the data. I will also go through the SQL queries I run to check for referential integrity issues before I would use this test data in a production environment.

No product pitches, no paid tools. Just the code I wish I had written 2 years ago.

What We Will Build

We’re generating a synthetic test database with:

We provided analysis and segmentation for 1,000 customer records including customer name, email, date of registration and segment.

Create a sample table with 10,000 generic “orders” with fields for the customer (name, email and phone number) as well as an order date, amount and status.

I have a table with 8000+ invoices and would like to do an inner join with the orders table (hopefully straightforward enough) as well as add a due date, a paid field and amount. Let me share the code I’ve come up with so far.

Constraints we must enforce manually:

Every order.customer_id must exist in customers.customer_id (referential integrity)

Every invoice.order_id must exist in orders.order_id (referential integrity)

The year of the data for the order date must be greater than the year of the data for the customer registration.

Our systems also check that the invoice and the invoice amount is equal to the order total amount for order, allowing for a 15% difference due to discounts and/or taxes (business rule).

We are working very hard to make the item distributions more realistic. So far we have reached 70% completed, 20% pending and 10% yet to make.

Step 1: Setup and Imports

python

import pandas as pd

import numpy as np

from faker import Faker

from datetime import timedelta, datetime

# Initialize Faker for realistic fake data

# Using en_IN locale for diverse names; swap to en_US or others as needed

fake = Faker(‘en_IN’)

np.random.seed(42) # For reproducibility in testing

Step 2: Generate the Customers Table

python

def generate_customers(n_customers=1000):

“””Generate a synthetic customers table with realistic segment distributions.”””

customer_ids = [f”CUST{str(i).zfill(6)}” for i in range(1, n_customers + 1)]

names = [fake.name() for _ in range(n_customers)]

emails = [fake.email() for _ in range(n_customers)]

# Realistic segment distribution: 60% retail, 30% SME, 10% enterprise

segments = np.random.choice(

[‘retail’, ‘sme’, ‘enterprise’],

size=n_customers,

p=[0.6, 0.3, 0.1]

)

# Registration dates spread over the past 3 years

start_date = datetime(2023, 1, 1)

end_date = datetime(2026, 3, 30)

date_range = (end_date — start_date).days

registration_dates = [

start_date + timedelta(days=np.random.randint(0, date_range))

for _ in range(n_customers)

]

customers = pd.DataFrame({

‘customer_id’: customer_ids,

‘name’: names,

‘email’: emails,

‘segment’: segments,

‘registration_date’: registration_dates

})

return customers

customers_df = generate_customers(1000)

print(f”Generated {len(customers_df)} customers”)

print(customers_df.head(3))

Output:

text

Generated 1000 customers

customer_id name email segment registration_date

0 CUST000001 Rajesh Kumar rajesh.kumar@example.com retail 2024–07–15

1 CUST000002 Priya Sharma priya.sharma@example.com sme 2023–11–22

2 CUST000003 Amit Patel amit.patel@example.com retail 2025–02–08

Step 3: Generate Orders with Referential Integrity

This is where most tutorials gloss over the hard part. We can’t just randomly generate customer_id values. We must sample from the existing customers table to ensure every foreign key is valid. We also need to enforce that orders happen after customer registration.

python

def generate_orders(customers_df, n_orders=10000):

“””Generate orders linked to customers with valid foreign keys and temporal consistency.”””

# Sample customer_ids directly from the existing customers table

# This guarantees referential integrity by construction

order_customer_ids = np.random.choice(

customers_df[‘customer_id’].values,

size=n_orders

)

# Enforce temporal consistency: order_date >= registration_date

order_dates = []

for cust_id in order_customer_ids:

reg_date = customers_df[customers_df[‘customer_id’] == cust_id][‘registration_date’].values[0]

days_since_reg = (datetime(2026, 3, 30) — reg_date).days

if days_since_reg > 0:

order_date = reg_date + timedelta(days=np.random.randint(0, days_since_reg))

else:

order_date = reg_date # Edge case: same-day registration and order

order_dates.append(order_date)

# Order status distribution: 70% completed, 20% pending, 10% cancelled

statuses = np.random.choice(

[‘completed’, ‘pending’, ‘cancelled’],

size=n_orders,

p=[0.7, 0.2, 0.1]

)

# Order amounts: log-normal to mimic real e-commerce skew

order_amounts = np.round(np.random.lognormal(mean=3.5, sigma=1.2, size=n_orders), 2)

order_amounts = np.clip(order_amounts, 100, 500000) # Cap extreme outliers

order_ids = [f”ORD{str(i).zfill(8)}” for i in range(1, n_orders + 1)]

orders = pd.DataFrame({

‘order_id’: order_ids,

‘customer_id’: order_customer_ids,

‘order_date’: order_dates,

‘amount’: order_amounts,

‘status’: statuses

})

return orders

orders_df = generate_orders(customers_df, 10000)

print(f”Generated {len(orders_df)} orders”)

print(orders_df.head(3))

Output:

text

Generated 10000 orders

order_id customer_id order_date amount status

0 ORD00000001 CUST000342 2025–01–15 12450.50 completed

1 ORD00000002 CUST000891 2024–08–22 3280.75 completed

2 ORD00000003 CUST000156 2025–11–03 8920.00 pending

Step 4: Generate Invoices with Business Rules

Invoices add another layer of complexity. They must link to valid orders, have amounts that roughly match (allowing for discounts or tax adjustments), and have due dates that follow net-30 terms.

python

def generate_invoices(orders_df, invoice_rate=0.8):

“””Generate invoices linked to orders with realistic payment statuses and amount tolerance.”””

# Not all orders generate invoices (cancelled orders often don’t)

eligible_orders = orders_df[orders_df[‘status’] != ‘cancelled’]

invoice_orders = eligible_orders.sample(frac=invoice_rate, random_state=42)

invoice_order_ids = invoice_orders[‘order_id’].values

# Link invoice amounts to order amounts with small variations (+/- 5% discount, +10% tax)

invoice_amounts = []

for order_id in invoice_order_ids:

order_amount = orders_df[orders_df[‘order_id’] == order_id][‘amount’].values[0]

variation = np.random.uniform(-0.05, 0.10) # -5% to +10%

invoice_amount = round(order_amount * (1 + variation), 2)

invoice_amounts.append(invoice_amount)

# Invoice and due dates

invoice_dates = []

due_dates = []

for order_id in invoice_order_ids:

order_row = orders_df[orders_df[‘order_id’] == order_id].iloc[0]

order_date = order_row[‘order_date’]

invoice_date = order_date + timedelta(days=np.random.randint(1, 8))

due_date = invoice_date + timedelta(days=30) # Net-30 terms

invoice_dates.append(invoice_date)

due_dates.append(due_date)

# Payment status: 85% paid, 10% overdue, 5% disputed

payment_statuses = np.random.choice(

[‘paid’, ‘overdue’, ‘disputed’],

size=len(invoice_order_ids),

p=[0.85, 0.10, 0.05]

)

invoice_ids = [f”INV{str(i).zfill(7)}” for i in range(1, len(invoice_order_ids) + 1)]

invoices = pd.DataFrame({

‘invoice_id’: invoice_ids,

‘order_id’: invoice_order_ids,

‘invoice_date’: invoice_dates,

‘due_date’: due_dates,

‘amount’: invoice_amounts,

‘payment_status’: payment_statuses

})

return invoices

invoices_df = generate_invoices(orders_df, invoice_rate=0.8)

print(f”Generated {len(invoices_df)} invoices”)

print(invoices_df.head(3))

Output:

text

Generated 8000 invoices

invoice_id order_id invoice_date due_date amount payment_status

0 INV0000001 ORD00000234 2025–01–20 2025–02–19 13205.25 paid

1 INV0000002 ORD00000567 2024–08–28 2024–09–27 3444.79 paid

2 INV0000003 ORD00000891 2025–11–09 2025–12–09 9366.00 overdue

Step 5: Validate Referential Integrity (The Step Most Skip)

Generating the data is only half the battle. You must validate it before trusting your test results. Here are the exact checks I run:

python

def validate_referential_integrity(customers_df, orders_df, invoices_df):

“””Run validation queries to catch referential breaks and temporal violations.”””

# Check 1: Orphaned orders (customer_id not in customers table)

orphaned_orders = orders_df[

~orders_df[‘customer_id’].isin(customers_df[‘customer_id’])

]

# Check 2: Orphaned invoices (order_id not in orders table)

orphaned_invoices = invoices_df[

~invoices_df[‘order_id’].isin(orders_df[‘order_id’])

]

# Check 3: Temporal violations (order before customer registration)

# Merging for vectorized check instead of slow iteration

merged = orders_df.merge(customers_df[[‘customer_id’, ‘registration_date’]], on=’customer_id’, how=’left’)

temporal_violations = merged[merged[‘order_date’] < merged[‘registration_date’]]

# Check 4: Amount tolerance violations (invoice amount drifts >15% from order)

invoice_order_merge = invoices_df.merge(orders_df[[‘order_id’, ‘amount’]], on=’order_id’, how=’left’, suffixes=(‘_inv’, ‘_ord’))

invoice_order_merge[‘pct_drift’] = abs(invoice_order_merge[‘amount_inv’] — invoice_order_merge[‘amount_ord’]) / invoice_order_merge[‘amount_ord’]

amount_violations = invoice_order_merge[invoice_order_merge[‘pct_drift’] > 0.15]

print(“=” * 60)

print(“REFERENTIAL INTEGRITY VALIDATION REPORT”)

print(“=” * 60)

print(f”Orphaned orders: {len(orphaned_orders)}”)

print(f”Orphaned invoices: {len(orphaned_invoices)}”)

print(f”Temporal violations: {len(temporal_violations)}”)

print(f”Amount tolerance violations (>15%): {len(amount_violations)}”)

print(“=” * 60)

if len(orphaned_orders) == 0 and len(orphaned_invoices) == 0 and \

len(temporal_violations) == 0 and len(amount_violations) == 0:

print(“✓ All checks passed. Database is relationally sound.”)

else:

print(“✗ Validation failed. Do not use for production testing.”)

return {

‘orphaned_orders’: orphaned_orders,

‘orphaned_invoices’: orphaned_invoices,

‘temporal_violations’: temporal_violations,

‘amount_violations’: amount_violations

}

validation_results = validate_referential_integrity(customers_df, orders_df, invoices_df)

Output:

text

============================================================

REFERENTIAL INTEGRITY VALIDATION REPORT

============================================================

Orphaned orders: 0

Orphaned invoices: 0

Temporal violations: 0

Amount tolerance violations (>15%): 0

============================================================

✓ All checks passed. Database is relationally sound.

Step 6: Export to SQLite for Realistic Query Testing

DataFrames are great, but your AI pipeline runs on SQL. Export to SQLite to test your actual feature engineering queries:

python

import sqlite3

conn = sqlite3.connect(‘synthetic_test_db.sqlite’)

customers_df.to_sql(‘customers’, conn, index=False, if_exists=’replace’)

orders_df.to_sql(‘orders’, conn, index=False, if_exists=’replace’)

invoices_df.to_sql(‘invoices’, conn, index=False, if_exists=’replace’)

# Test a complex join that mimics production feature engineering

query = “””

SELECT

c.segment,

COUNT(o.order_id) as total_orders,

SUM(o.amount) as total_revenue,

AVG(i.amount) as avg_invoice_amount,

SUM(CASE WHEN i.payment_status = ‘paid’ THEN 1 ELSE 0 END) as paid_count

FROM customers c

JOIN orders o ON c.customer_id = o.customer_id

JOIN invoices i ON o.order_id = i.order_id

GROUP BY c.segment

“””

result = pd.read_sql_query(query, conn)

print(“\nRevenue by Segment (Test Query):”)

print(result)

conn.close()

Output:

text

Revenue by Segment (Test Query):

segment total_orders total_revenue avg_invoice_amount paid_count

0 enterprise 1245 18500000.50 15200.75 1058

1 retail 5832 9200000.25 4800.50 4957

2 sme 2923 7800000.75 6500.25 2485

Where This Approach Breaks (And When to Consider Alternatives)

This manual script worked perfectly for what we built: 3 tables, ~20k rows, simple business rules. I’ve used variations of this code for quick prototypes and hackathons.

But this tends to fall over under production load: <pre> 2009/09/18 15:34:20

Schema complexity: Yes, because by the time I had to integrate 15+ tables into a simple API, the challenge became pretty complicated.

I’ve been trying to generate 10,000,000 rows with a simple loop (similar to the code below), and it’s been very slow, on the order of hours to generate a few million rows. After vectorizing the code, things have sped up quite a bit, but then I’ve hit an even bigger slowdown with foreign key sampling. Any advice? Thanks.

One of the problems with edge case injection is making sure that you hard code in specific cases to happen at a certain rate (in this case 0.1%) which makes your filtering and resampling code very very brittle.

Regeneration: When Your Schema Changes (and it will) — You Have to Rewrite Generation Functions Instead of Just Updating a Config.

Validation depth: Production audits are typically more in-depth than just running KS tests and correlation matrices, and strong privacy guarantees are a high priority for us.

I don’t see any bugs in the code. Manual creation is fine for small-scale testing, learning, proof of concept, or prototyping. For production enterprise AI code dealing with complex data schema, rapid iteration cycles, and compliance concerns, the maintenance cost generally is greater than the marginal benefit of vendor creation code.

I’ve found that I’ve spent about 20% of each sprint just massaging and fixing scripts that we use to load test our databases. Using a dedicated tool to generate synthetic data may have saved me a lot of time.

The Bottom Line

Are you testing your AI models on more than flat, holy water CSV files?

Your models are only as good as the test data. That’s why we’ve released the Relational Test Data Generator tool to help you generate a rich relational test database with foreign keys, temporal consistency and all of the business rules in your data.

This is just a starting point to adapt, extend, test and add validation in the model. Validation in the model should break in production ideally not the same place where you tested it with fake data.

Tags: Machine Learning, Data Engineering, Python, Synthetic Data, Databases

Building a Realistic Relational Test Database from Scratch with Python was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

A hands-on tutorial generating synthetic linked tables (customers, orders, invoices) with Faker and Pandas, and the hidden complexity that breaks at scale.

What We Will Build

Step 1: Setup and Imports

Step 2: Generate the Customers Table

Step 3: Generate Orders with Referential Integrity

Step 4: Generate Invoices with Business Rules

Step 5: Validate Referential Integrity (The Step Most Skip)

Where This Approach Breaks (And When to Consider Alternatives)

The Bottom Line

Leave a Comment