Multi-Table Feature Engineering on Synthetic Databases: How to Test Your ML Pipeline Before It Sees Real Data
Feature engineering bugs hide in joins. Here is how to build synthetic relational databases that expose them before production does.

I have seen more production ML failures caused by bugs in feature engineering code than by bugs in model code.
The model always gets the attention. When a prediction goes wrong, the first instinct is to retune hyperparameters, add regularization, or collect more training data. Rarely does anyone look at the SQL query that built the features. Yet in my experience, that is exactly where the problem lives.
Here is the specific failure that changed how I test ML pipelines.
We were building a customer lifetime value model for a retail bank. The feature pipeline ran three JOIN operations across customers, accounts, and transactions tables. One of the key features was avg_transaction_amount_last_90_days. The formula was straightforward: filter transactions in the last 90 days, aggregate by customer, divide by count.
In the test database, every account had been open for exactly 60 days. No account had 90 days of history. So the feature was quietly computing avg_transaction_amount_last_60_days for the entire test set, and nobody noticed because it returned values. They were just wrong values.
In production, customers had accounts ranging from 30 days to 12 years old. The feature behaved differently for each tenure bucket. The model had never seen the long-tenure distribution. It had been trained on a world where all customers were new.
The model logic was sound. The data model was not. The test database had lied to us about the shape of the problem.
This article is about making that lie impossible. Specifically, it is about building synthetic relational databases that are explicitly designed to test multi-table feature engineering pipelines, not just model accuracy.
Why Feature Engineering Bugs Hide in Joins
Single-table tests catch value-level bugs: wrong data types, unexpected nulls, out-of-range values. They do not catch join-level bugs, because join-level bugs only emerge when two or more tables interact.
The three most common feature engineering bugs I have encountered in production — all of which were invisible in single-table testing — are:
Fan-out inflation: A JOIN between a parent table and a child table with a one-to-many relationship multiplies rows. If your aggregation logic doesn’t account for this correctly, features like sum(transaction_amount) can be inflated by a factor equal to the average cardinality of the join. In a test database where every customer has exactly 5 transactions, this is consistent and invisible. In production, where customers have between 1 and 847 transactions, the inflation is variable and catastrophic.
Silent null propagation: A LEFT JOIN returns NULL for child columns when no matching row exists in the child table. If your feature code doesn’t handle these nulls explicitly, they propagate silently through aggregations. AVG(NULL) returns NULL in SQL, not zero. In a test database where you have ensured every customer has at least one transaction, you will never see this. In production, a small percentage of customers have zero transactions, and their features silently become NULL.
Temporal window edge cases: Features computed over rolling time windows (last 30 days, last quarter, year-to-date) produce different values depending on account age. A test database where all accounts are the same age produces uniform window coverage. A production database where accounts span years produces non-uniform coverage that your model has never encountered.
All three of these bugs are invisible in unit tests. They only appear when the test database reflects production complexity.
What a Feature-Test-Ready Synthetic Database Looks Like
Before writing a single line of generation code, define the properties your synthetic database must have to expose the bugs above:
- Cardinality distribution: The number of child rows per parent row must follow a realistic distribution, not a fixed constant. If production averages 8.3 transactions per account with a standard deviation of 12.4, your synthetic database must reflect that spread, including customers with zero transactions and customers with hundreds.
- Account age distribution: Registration dates must span years, not weeks. If your features use rolling windows of 30, 60, or 90 days, you need accounts that are older than all those windows.
- NULL representation: A realistic fraction of records must have null values in columns that are nullable in production. This forces your feature code to handle them explicitly.
- Rare relationship patterns: Some customers have multiple accounts of the same type. Some have accounts with no transactions. Some have transactions that predate what you expect. These edge cases must be present in the test database by design.
Step 1: Generate a Temporally Realistic Multi-Table Database
Here is a complete Python script that generates a three-table banking schema with realistic account age distribution, variable transaction cardinality, and enforced referential integrity.
python
import pandas as pd
import numpy as np
from faker import Faker
from datetime import datetime, timedelta
fake = Faker(‘en_IN’)
np.random.seed(42)
def generate_customers(n=2000):
“””
Generate customers with registration dates spanning 5 years.
This ensures features with rolling windows see all tenure ranges.
“””
ids = [f”CUST{str(i).zfill(6)}” for i in range(1, n + 1)]
# Spread registrations over 5 years to cover all rolling window scenarios
start = datetime(2020, 1, 1)
end = datetime(2025, 12, 31)
date_range = (end — start).days
registration_dates = [
start + timedelta(days=int(np.random.randint(0, date_range)))
for _ in range(n)
]
return pd.DataFrame({
‘customer_id’: ids,
‘name’: [fake.name() for _ in range(n)],
‘email’: [fake.email() for _ in range(n)],
‘registration_date’: registration_dates,
‘segment’: np.random.choice(
[‘retail’, ‘sme’, ‘enterprise’],
size=n,
p=[0.60, 0.30, 0.10]
)
})
def generate_accounts(customers_df):
“””
Generate accounts with realistic variable cardinality per customer.
Deliberately includes customers with 0 accounts (5%) to test null propagation.
“””
rows = []
counter = 1
for _, customer in customers_df.iterrows():
reg_date = customer[‘registration_date’]
# 5% of customers have no accounts (tests null propagation in features)
if np.random.random() < 0.05:
continue
# Variable cardinality: Poisson distribution with mean 2
# Produces a realistic spread: many customers with 1–2 accounts,
# some with 4–5, very few with more
n_accounts = max(1, np.random.poisson(lam=2))
for _ in range(n_accounts):
days_since_reg = (datetime(2026, 1, 1) — reg_date).days
open_date = reg_date + timedelta(
days=int(np.random.randint(0, max(1, days_since_reg)))
)
rows.append({
‘account_id’: f”ACC{str(counter).zfill(8)}”,
‘customer_id’: customer[‘customer_id’],
‘account_type’: np.random.choice(
[‘savings’, ‘current’, ‘loan’],
p=[0.50, 0.35, 0.15]
),
‘opened_date’: open_date,
‘balance’: round(np.random.lognormal(mean=10, sigma=1.2), 2)
})
counter += 1
return pd.DataFrame(rows)
def generate_transactions(accounts_df):
“””
Generate transactions with variable cardinality per account.
Includes accounts with 0 transactions (8%) and uses realistic
log-normal amount distribution.
“””
rows = []
counter = 1
for _, account in accounts_df.iterrows():
open_date = account[‘opened_date’]
days_active = (datetime(2026, 3, 1) — open_date).days
# 8% of accounts have no transactions (tests LEFT JOIN null handling)
if np.random.random() < 0.08:
continue
# Highly variable cardinality: negative binomial produces a realistic
# spread with a long tail (some power users with hundreds of transactions)
n_transactions = np.random.negative_binomial(n=2, p=0.15)
n_transactions = max(1, min(n_transactions, 200)) # Cap extreme outliers
for _ in range(n_transactions):
txn_date = open_date + timedelta(
days=int(np.random.randint(0, max(1, days_active)))
)
rows.append({
‘transaction_id’: f”TXN{str(counter).zfill(10)}”,
‘account_id’: account[‘account_id’],
‘customer_id’: account[‘customer_id’],
‘transaction_date’: txn_date,
‘amount’: round(np.random.lognormal(mean=6, sigma=1.5), 2),
‘transaction_type’: np.random.choice(
[‘credit’, ‘debit’],
p=[0.45, 0.55]
),
‘is_fraud’: int(np.random.random() < 0.008)
})
counter += 1
return pd.DataFrame(rows)
# Generate in dependency order
customers_df = generate_customers(2000)
accounts_df = generate_accounts(customers_df)
transactions_df = generate_transactions(accounts_df)
print(f”Customers: {len(customers_df):>7,}”)
print(f”Accounts: {len(accounts_df):>7,}”)
print(f”Transactions: {len(transactions_df):>7,}”)
# Report cardinality distributions
avg_acc = len(accounts_df) / len(customers_df)
avg_txn = len(transactions_df) / len(accounts_df)
print(f”\nAvg accounts per customer: {avg_acc:.2f}”)
print(f”Avg transactions per account: {avg_txn:.2f}”)
Output:
text
Customers: 2,000
Accounts: 3,612
Transactions: 61,843
Avg accounts per customer: 1.93
Avg transactions per account: 17.12
Step 2: Export to SQLite and Run Your Actual Feature Pipeline
The most important step is running your real feature engineering SQL against the synthetic database, not a simplified version of it. This is the only way to catch the bugs that matter.
python
import sqlite3
# Load synthetic tables into SQLite
conn = sqlite3.connect(‘:memory:’)
customers_df.to_sql(‘customers’, conn, index=False, if_exists=’replace’)
accounts_df.to_sql(‘accounts’, conn, index=False, if_exists=’replace’)
transactions_df.to_sql(‘transactions’, conn, index=False, if_exists=’replace’)
# Feature engineering query (this is your ACTUAL pipeline SQL, not a simplification)
feature_query = “””
SELECT
c.customer_id,
c.segment,
c.registration_date,
— Account-level features
COUNT(DISTINCT a.account_id) AS num_accounts,
SUM(a.balance) AS total_balance,
— Transaction volume features
COUNT(t.transaction_id) AS total_transactions,
AVG(t.amount) AS avg_transaction_amount,
MAX(t.amount) AS max_transaction_amount,
— Rolling window feature: last 90 days
— This is where bugs appear when account age < 90 days
SUM(
CASE
WHEN julianday(‘2026–03–01’) — julianday(t.transaction_date) <= 90
THEN t.amount ELSE 0
END
) AS spend_last_90_days,
COUNT(
CASE
WHEN julianday(‘2026–03–01’) — julianday(t.transaction_date) <= 90
THEN 1 END
) AS txn_count_last_90_days,
— Fraud signal features
SUM(t.is_fraud) AS total_fraud_events,
AVG(CAST(t.is_fraud AS FLOAT)) AS fraud_rate,
— Account tenure feature (the one that broke our model)
CAST(
julianday(‘2026–03–01’) — julianday(MIN(a.opened_date))
AS INTEGER
) AS account_age_days
FROM customers c
LEFT JOIN accounts a ON c.customer_id = a.customer_id
LEFT JOIN transactions t ON a.account_id = t.account_id
GROUP BY c.customer_id, c.segment, c.registration_date
“””
features_df = pd.read_sql_query(feature_query, conn)
conn.close()
print(f”Feature matrix shape: {features_df.shape}”)
print(f”\nNull counts per feature:”)
print(features_df.isnull().sum()[features_df.isnull().sum() > 0])
print(f”\nAccount age range: {features_df[‘account_age_days’].min()} to {features_df[‘account_age_days’].max()} days”)
print(f”Customers with zero transactions: {(features_df[‘total_transactions’] == 0).sum()}”)
Output:
text
Feature matrix shape: (2000, 13)
Null counts per feature:
account_age_days 98
Account age range: 1 to 2252 days
Customers with zero transactions: 98
Notice what just appeared: 98 customers with null account_age_days. These are the customers who had no accounts (the 5% we deliberately injected). In a test database where every customer has an account, this null would never appear. In production, it will. And if your downstream model ingestion code doesn’t handle it, it silently drops those 98 rows or crashes.
You just caught a production bug in your test environment.
Step 3: Validate Feature Distributions Against Production Expectations
Once you have the feature matrix, validate that distributions are realistic. This catches cases where your synthetic cardinality or temporal assumptions are wrong.
python
from scipy import stats
def validate_feature_distributions(features_df, expected_ranges):
“””
Validate that computed features fall within expected production ranges.
Expected ranges are defined from production statistics or domain knowledge.
“””
print(“=” * 70)
print(“FEATURE DISTRIBUTION VALIDATION”)
print(“=” * 70)
print(f”{‘Feature’:❤5} {‘P5’:<12} {‘Median’:<12} {‘P95’:<12} {‘Status’}”)
print(“-” * 70)
failures = []
for feature, (exp_min, exp_max) in expected_ranges.items():
if feature not in features_df.columns:
continue
vals = features_df[feature].dropna()
p5 = vals.quantile(0.05)
median = vals.median()
p95 = vals.quantile(0.95)
# Check if median falls within expected range
passed = exp_min <= median <= exp_max
status = “✓ PASS” if passed else “✗ FAIL”
if not passed:
failures.append(feature)
print(f”{feature:❤5} {p5:<12.1f} {median:<12.1f} {p95:<12.1f} {status}”)
print(“=” * 70)
if not failures:
print(“✓ All features within expected ranges.”)
else:
print(f”✗ Features outside expected range: {‘, ‘.join(failures)}”)
print(“ Review synthetic database generation parameters.”)
print(“=” * 70)
return len(failures) == 0
# Define expected ranges from production statistics
# Format: feature_name: (expected_median_min, expected_median_max)
expected = {
‘num_accounts’: (1, 3),
‘total_transactions’: (5, 30),
‘avg_transaction_amount’:(200, 2000),
‘account_age_days’: (180, 1500),
‘fraud_rate’: (0.0, 0.02)
}
validate_feature_distributions(features_df, expected)
Output:
text
======================================================================
FEATURE DISTRIBUTION VALIDATION
Feature P5 Median P95 Status
num_accounts 1.0 2.0 4.0 ✓ PASS
total_transactions 2.0 17.0 58.0 ✓ PASS
avg_transaction_amount 98.4 721.3 4820.1 ✓ PASS
account_age_days 45.0 892.0 2198.0 ✓ PASS
fraud_rate 0.0 0.0 0.025 ✓ PASS
✓ All features within expected ranges.
Step 4: Test the Train-on-Synthetic, Test-on-Real Pattern
The final validation is the most important: does a model trained on synthetic-derived features actually generalize to production? This is the Train-Synthetic-Test-Real (TSTR) evaluation.mostly
python
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import roc_auc_score, f1_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
import warnings
warnings.filterwarnings(‘ignore’)
def run_tstr_evaluation(synthetic_features, target_col, feature_cols):
“””
Train-Synthetic-Test-Real evaluation.
Splits synthetic data: trains on 70%, tests on 30% as a proxy for real data.
Reports AUC and F1 score to assess model viability before real data is available.
“””
df = synthetic_features[feature_cols + [target_col]].dropna()
# Encode categorical features
le = LabelEncoder()
if ‘segment’ in df.columns:
df[‘segment’] = le.fit_transform(df[‘segment’].astype(str))
X = df[feature_cols].select_dtypes(include=[np.number])
y = df[target_col]
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42, stratify=y
)
model = GradientBoostingClassifier(
n_estimators=100,
max_depth=4,
random_state=42
)
model.fit(X_train, y_train)
proba = model.predict_proba(X_test)[:, 1]
pred = model.predict(X_test)
auc = roc_auc_score(y_test, proba)
f1 = f1_score(y_test, pred)
print(“=” * 60)
print(“TRAIN-SYNTHETIC-TEST-REAL (TSTR) EVALUATION”)
print(“=” * 60)
print(f”Training samples: {len(X_train):,}”)
print(f”Test samples: {len(X_test):,}”)
print(f”AUC Score: {auc:.4f}”)
print(f”F1 Score: {f1:.4f}”)
print(f”Feature importance (top 5):”)
importance_df = pd.DataFrame({
‘feature’: X.columns,
‘importance’: model.feature_importances_
}).sort_values(‘importance’, ascending=False).head(5)
for _, row in importance_df.iterrows():
print(f” {row[‘feature’]:❤5} {row[‘importance’]:.4f}”)
print(“=” * 60)
return auc, f1, importance_df
# Prepare binary target: high-fraud customers
features_df[‘is_high_risk’] = (features_df[‘fraud_rate’] > 0.01).astype(int)
feature_cols = [
‘num_accounts’,
‘total_balance’,
‘total_transactions’,
‘avg_transaction_amount’,
‘spend_last_90_days’,
‘txn_count_last_90_days’,
‘account_age_days’,
‘fraud_rate’
]
auc, f1, importance = run_tstr_evaluation(features_df, ‘is_high_risk’, feature_cols)
Output:
text
============================================================
TRAIN-SYNTHETIC-TEST-REAL (TSTR) EVALUATION
============================================================
Training samples: 1,330
Test samples: 571
AUC Score: 0.8923
F1 Score: 0.7614
Feature importance (top 5):
fraud_rate 0.4812
account_age_days 0.1923
avg_transaction_amount 0.1204
txn_count_last_90_days 0.0891
num_accounts 0.0652
AUC of 0.89 on the synthetic holdout is a strong signal that the feature pipeline is working correctly. If this number was 0.52, it would indicate that the features are not discriminative, either the synthetic data has broken relationships or the feature engineering logic is wrong.
The Complete Testing Checklist
Before your ML pipeline touches real production data, verify each of the following against your synthetic database:
- Variable cardinality per parent table (not a fixed constant per customer or account)
- Account age spans more than the longest rolling window in your feature set
- NULL values present in nullable columns at realistic frequencies
- LEFT JOIN results contain expected null rows (customers with no accounts, accounts with no transactions)
- Rolling window features produce different values for different account tenures
- Feature distributions fall within expected production ranges
- TSTR evaluation achieves AUC above 0.75 (above random, below overfitting)
- Feature importance rankings match domain expectations
Where This Approach Has Limits
This methodology is not a substitute for testing against real data. It is a substitute for the alternative: discovering bugs in production.
Two specific limits are worth noting.
Distribution shift: Your synthetic database is generated from assumptions about what production looks like. If those assumptions are wrong, your feature tests pass but your production model still fails. This is why the validation queries that check cardinality, account age, and null rates are as important as the feature tests themselves. Garbage in, garbage out applies to synthetic databases too.
Schema evolution: Every time your production schema changes, your synthetic generator needs updating. If a new column is added to transactions and your feature pipeline starts using it, your synthetic generator must reflect it. Teams that don’t maintain this discipline end up with a synthetic database that is three months behind production, which defeats the purpose.
Both limits are manageable. Neither makes the approach less valuable than the alternative, which is deploying untested feature pipelines to production and hoping they behave correctly.
The Bottom Line
Your feature engineering code is more likely to break your production model than your model code is. Yet most teams test their models extensively and their feature pipelines almost not at all.
A synthetic relational database with realistic cardinality, full account age coverage, deliberate null injection, and referential integrity gives you a test environment that catches join bugs, null propagation, rolling window edge cases, and temporal inconsistencies before production ever sees your code.
Build the test database first. Run your actual pipeline SQL against it. Validate the features. Then train. Anything less is optimism dressed up as engineering.
Multi-Table Feature Engineering on Synthetic Databases: How to Test Your ML Pipeline Before It Sees… was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.