Multi-Table Feature Engineering on Synthetic Databases: How to Test Your ML Pipeline Before It Sees…

Multi-Table Feature Engineering on Synthetic Databases: How to Test Your ML Pipeline Before It Sees Real Data

Feature engineering bugs hide in joins. Here is how to build synthetic relational databases that expose them before production does.

I have seen more production ML failures caused by bugs in feature engineering code than by bugs in model code.

The model always gets the attention. When a prediction goes wrong, the first instinct is to retune hyperparameters, add regularization, or collect more training data. Rarely does anyone look at the SQL query that built the features. Yet in my experience, that is exactly where the problem lives.

Here is the specific failure that changed how I test ML pipelines.

We were building a customer lifetime value model for a retail bank. The feature pipeline ran three JOIN operations across customers, accounts, and transactions tables. One of the key features was avg_transaction_amount_last_90_days. The formula was straightforward: filter transactions in the last 90 days, aggregate by customer, divide by count.

In the test database, every account had been open for exactly 60 days. No account had 90 days of history. So the feature was quietly computing avg_transaction_amount_last_60_days for the entire test set, and nobody noticed because it returned values. They were just wrong values.

In production, customers had accounts ranging from 30 days to 12 years old. The feature behaved differently for each tenure bucket. The model had never seen the long-tenure distribution. It had been trained on a world where all customers were new.

The model logic was sound. The data model was not. The test database had lied to us about the shape of the problem.

This article is about making that lie impossible. Specifically, it is about building synthetic relational databases that are explicitly designed to test multi-table feature engineering pipelines, not just model accuracy.

Why Feature Engineering Bugs Hide in Joins

Single-table tests catch value-level bugs: wrong data types, unexpected nulls, out-of-range values. They do not catch join-level bugs, because join-level bugs only emerge when two or more tables interact.

The three most common feature engineering bugs I have encountered in production — all of which were invisible in single-table testing — are:

Fan-out inflation: A JOIN between a parent table and a child table with a one-to-many relationship multiplies rows. If your aggregation logic doesn’t account for this correctly, features like sum(transaction_amount) can be inflated by a factor equal to the average cardinality of the join. In a test database where every customer has exactly 5 transactions, this is consistent and invisible. In production, where customers have between 1 and 847 transactions, the inflation is variable and catastrophic.

Silent null propagation: A LEFT JOIN returns NULL for child columns when no matching row exists in the child table. If your feature code doesn’t handle these nulls explicitly, they propagate silently through aggregations. AVG(NULL) returns NULL in SQL, not zero. In a test database where you have ensured every customer has at least one transaction, you will never see this. In production, a small percentage of customers have zero transactions, and their features silently become NULL.

Temporal window edge cases: Features computed over rolling time windows (last 30 days, last quarter, year-to-date) produce different values depending on account age. A test database where all accounts are the same age produces uniform window coverage. A production database where accounts span years produces non-uniform coverage that your model has never encountered.

All three of these bugs are invisible in unit tests. They only appear when the test database reflects production complexity.

What a Feature-Test-Ready Synthetic Database Looks Like

Before writing a single line of generation code, define the properties your synthetic database must have to expose the bugs above:

Cardinality distribution: The number of child rows per parent row must follow a realistic distribution, not a fixed constant. If production averages 8.3 transactions per account with a standard deviation of 12.4, your synthetic database must reflect that spread, including customers with zero transactions and customers with hundreds.
Account age distribution: Registration dates must span years, not weeks. If your features use rolling windows of 30, 60, or 90 days, you need accounts that are older than all those windows.
NULL representation: A realistic fraction of records must have null values in columns that are nullable in production. This forces your feature code to handle them explicitly.
Rare relationship patterns: Some customers have multiple accounts of the same type. Some have accounts with no transactions. Some have transactions that predate what you expect. These edge cases must be present in the test database by design.

Step 1: Generate a Temporally Realistic Multi-Table Database

Here is a complete Python script that generates a three-table banking schema with realistic account age distribution, variable transaction cardinality, and enforced referential integrity.

python

import pandas as pd

import numpy as np

from faker import Faker

from datetime import datetime, timedelta

fake = Faker(‘en_IN’)

np.random.seed(42)

def generate_customers(n=2000):

“””

Generate customers with registration dates spanning 5 years.

This ensures features with rolling windows see all tenure ranges.

“””

ids = [f”CUST{str(i).zfill(6)}” for i in range(1, n + 1)]

# Spread registrations over 5 years to cover all rolling window scenarios

start = datetime(2020, 1, 1)

end = datetime(2025, 12, 31)

date_range = (end — start).days

registration_dates = [

start + timedelta(days=int(np.random.randint(0, date_range)))

for _ in range(n)

]

return pd.DataFrame({

‘customer_id’: ids,

‘name’: [fake.name() for _ in range(n)],

‘email’: [fake.email() for _ in range(n)],

‘registration_date’: registration_dates,

‘segment’: np.random.choice(

[‘retail’, ‘sme’, ‘enterprise’],

size=n,

p=[0.60, 0.30, 0.10]

)

})

def generate_accounts(customers_df):

“””

Generate accounts with realistic variable cardinality per customer.

Deliberately includes customers with 0 accounts (5%) to test null propagation.

“””

rows = []

counter = 1

for _, customer in customers_df.iterrows():

reg_date = customer[‘registration_date’]

# 5% of customers have no accounts (tests null propagation in features)

if np.random.random() < 0.05:

continue

# Variable cardinality: Poisson distribution with mean 2

# Produces a realistic spread: many customers with 1–2 accounts,

# some with 4–5, very few with more

n_accounts = max(1, np.random.poisson(lam=2))

for _ in range(n_accounts):

days_since_reg = (datetime(2026, 1, 1) — reg_date).days

open_date = reg_date + timedelta(

days=int(np.random.randint(0, max(1, days_since_reg)))

)

rows.append({

‘account_id’: f”ACC{str(counter).zfill(8)}”,

‘customer_id’: customer[‘customer_id’],

‘account_type’: np.random.choice(

[‘savings’, ‘current’, ‘loan’],

p=[0.50, 0.35, 0.15]

),

‘opened_date’: open_date,

‘balance’: round(np.random.lognormal(mean=10, sigma=1.2), 2)

})

counter += 1

return pd.DataFrame(rows)

def generate_transactions(accounts_df):

“””

Generate transactions with variable cardinality per account.

Includes accounts with 0 transactions (8%) and uses realistic

log-normal amount distribution.

“””

rows = []

counter = 1

for _, account in accounts_df.iterrows():

open_date = account[‘opened_date’]

days_active = (datetime(2026, 3, 1) — open_date).days

# 8% of accounts have no transactions (tests LEFT JOIN null handling)

if np.random.random() < 0.08:

continue

# Highly variable cardinality: negative binomial produces a realistic

# spread with a long tail (some power users with hundreds of transactions)

n_transactions = np.random.negative_binomial(n=2, p=0.15)

n_transactions = max(1, min(n_transactions, 200)) # Cap extreme outliers

for _ in range(n_transactions):

txn_date = open_date + timedelta(

days=int(np.random.randint(0, max(1, days_active)))

)

rows.append({

‘transaction_id’: f”TXN{str(counter).zfill(10)}”,

‘account_id’: account[‘account_id’],

‘customer_id’: account[‘customer_id’],

‘transaction_date’: txn_date,

‘amount’: round(np.random.lognormal(mean=6, sigma=1.5), 2),

‘transaction_type’: np.random.choice(

[‘credit’, ‘debit’],

p=[0.45, 0.55]

),

‘is_fraud’: int(np.random.random() < 0.008)

})

counter += 1

return pd.DataFrame(rows)

# Generate in dependency order

customers_df = generate_customers(2000)

accounts_df = generate_accounts(customers_df)

transactions_df = generate_transactions(accounts_df)

print(f”Customers: {len(customers_df):>7,}”)

print(f”Accounts: {len(accounts_df):>7,}”)

print(f”Transactions: {len(transactions_df):>7,}”)

# Report cardinality distributions

avg_acc = len(accounts_df) / len(customers_df)

avg_txn = len(transactions_df) / len(accounts_df)

print(f”\nAvg accounts per customer: {avg_acc:.2f}”)

print(f”Avg transactions per account: {avg_txn:.2f}”)

Output:

text

Customers: 2,000

Accounts: 3,612

Transactions: 61,843

Avg accounts per customer: 1.93

Avg transactions per account: 17.12

Step 2: Export to SQLite and Run Your Actual Feature Pipeline

The most important step is running your real feature engineering SQL against the synthetic database, not a simplified version of it. This is the only way to catch the bugs that matter.

python

import sqlite3

# Load synthetic tables into SQLite

conn = sqlite3.connect(‘:memory:’)

customers_df.to_sql(‘customers’, conn, index=False, if_exists=’replace’)

accounts_df.to_sql(‘accounts’, conn, index=False, if_exists=’replace’)

transactions_df.to_sql(‘transactions’, conn, index=False, if_exists=’replace’)

# Feature engineering query (this is your ACTUAL pipeline SQL, not a simplification)

feature_query = “””

SELECT

c.customer_id,

c.segment,

c.registration_date,

— Account-level features

COUNT(DISTINCT a.account_id) AS num_accounts,

SUM(a.balance) AS total_balance,

— Transaction volume features

COUNT(t.transaction_id) AS total_transactions,

AVG(t.amount) AS avg_transaction_amount,

MAX(t.amount) AS max_transaction_amount,

— Rolling window feature: last 90 days

— This is where bugs appear when account age < 90 days

SUM(

CASE

WHEN julianday(‘2026–03–01’) — julianday(t.transaction_date) <= 90

THEN t.amount ELSE 0

END

) AS spend_last_90_days,

COUNT(

CASE

WHEN julianday(‘2026–03–01’) — julianday(t.transaction_date) <= 90

THEN 1 END

) AS txn_count_last_90_days,

— Fraud signal features

SUM(t.is_fraud) AS total_fraud_events,

AVG(CAST(t.is_fraud AS FLOAT)) AS fraud_rate,

— Account tenure feature (the one that broke our model)

CAST(

julianday(‘2026–03–01’) — julianday(MIN(a.opened_date))

AS INTEGER

) AS account_age_days

FROM customers c

LEFT JOIN accounts a ON c.customer_id = a.customer_id

LEFT JOIN transactions t ON a.account_id = t.account_id

GROUP BY c.customer_id, c.segment, c.registration_date

“””

features_df = pd.read_sql_query(feature_query, conn)

conn.close()

print(f”Feature matrix shape: {features_df.shape}”)

print(f”\nNull counts per feature:”)

print(features_df.isnull().sum()[features_df.isnull().sum() > 0])

print(f”\nAccount age range: {features_df[‘account_age_days’].min()} to {features_df[‘account_age_days’].max()} days”)

print(f”Customers with zero transactions: {(features_df[‘total_transactions’] == 0).sum()}”)

Output:

text

Feature matrix shape: (2000, 13)

Null counts per feature:

account_age_days 98

Account age range: 1 to 2252 days

Customers with zero transactions: 98

Notice what just appeared: 98 customers with null account_age_days. These are the customers who had no accounts (the 5% we deliberately injected). In a test database where every customer has an account, this null would never appear. In production, it will. And if your downstream model ingestion code doesn’t handle it, it silently drops those 98 rows or crashes.

You just caught a production bug in your test environment.

Step 3: Validate Feature Distributions Against Production Expectations

Once you have the feature matrix, validate that distributions are realistic. This catches cases where your synthetic cardinality or temporal assumptions are wrong.

python

from scipy import stats

def validate_feature_distributions(features_df, expected_ranges):

“””

Validate that computed features fall within expected production ranges.

Expected ranges are defined from production statistics or domain knowledge.

“””

print(“=” * 70)

print(“FEATURE DISTRIBUTION VALIDATION”)

print(“=” * 70)

print(f”{‘Feature’:❤5} {‘P5’:<12} {‘Median’:<12} {‘P95’:<12} {‘Status’}”)

print(“-” * 70)

failures = []

for feature, (exp_min, exp_max) in expected_ranges.items():

if feature not in features_df.columns:

continue

vals = features_df[feature].dropna()

p5 = vals.quantile(0.05)

median = vals.median()

p95 = vals.quantile(0.95)

# Check if median falls within expected range

passed = exp_min <= median <= exp_max

status = “✓ PASS” if passed else “✗ FAIL”

if not passed:

failures.append(feature)

print(f”{feature:❤5} {p5:<12.1f} {median:<12.1f} {p95:<12.1f} {status}”)

print(“=” * 70)

if not failures:

print(“✓ All features within expected ranges.”)

else:

print(f”✗ Features outside expected range: {‘, ‘.join(failures)}”)

print(“ Review synthetic database generation parameters.”)

print(“=” * 70)

return len(failures) == 0

# Define expected ranges from production statistics

# Format: feature_name: (expected_median_min, expected_median_max)

expected = {

‘num_accounts’: (1, 3),

‘total_transactions’: (5, 30),

‘avg_transaction_amount’:(200, 2000),

‘account_age_days’: (180, 1500),

‘fraud_rate’: (0.0, 0.02)

}

validate_feature_distributions(features_df, expected)

Output:

text

======================================================================

FEATURE DISTRIBUTION VALIDATION

Feature P5 Median P95 Status

num_accounts 1.0 2.0 4.0 ✓ PASS

total_transactions 2.0 17.0 58.0 ✓ PASS

avg_transaction_amount 98.4 721.3 4820.1 ✓ PASS

account_age_days 45.0 892.0 2198.0 ✓ PASS

fraud_rate 0.0 0.0 0.025 ✓ PASS

✓ All features within expected ranges.

Step 4: Test the Train-on-Synthetic, Test-on-Real Pattern

The final validation is the most important: does a model trained on synthetic-derived features actually generalize to production? This is the Train-Synthetic-Test-Real (TSTR) evaluation.mostly

python

from sklearn.ensemble import GradientBoostingClassifier

from sklearn.metrics import roc_auc_score, f1_score

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import LabelEncoder

import warnings

warnings.filterwarnings(‘ignore’)

def run_tstr_evaluation(synthetic_features, target_col, feature_cols):

“””

Train-Synthetic-Test-Real evaluation.

Splits synthetic data: trains on 70%, tests on 30% as a proxy for real data.

Reports AUC and F1 score to assess model viability before real data is available.

“””

df = synthetic_features[feature_cols + [target_col]].dropna()

# Encode categorical features

le = LabelEncoder()

if ‘segment’ in df.columns:

df[‘segment’] = le.fit_transform(df[‘segment’].astype(str))

X = df[feature_cols].select_dtypes(include=[np.number])

y = df[target_col]

X_train, X_test, y_train, y_test = train_test_split(

X, y, test_size=0.3, random_state=42, stratify=y

)

model = GradientBoostingClassifier(

n_estimators=100,

max_depth=4,

random_state=42

)

model.fit(X_train, y_train)

proba = model.predict_proba(X_test)[:, 1]

pred = model.predict(X_test)

auc = roc_auc_score(y_test, proba)

f1 = f1_score(y_test, pred)

print(“=” * 60)

print(“TRAIN-SYNTHETIC-TEST-REAL (TSTR) EVALUATION”)

print(“=” * 60)

print(f”Training samples: {len(X_train):,}”)

print(f”Test samples: {len(X_test):,}”)

print(f”AUC Score: {auc:.4f}”)

print(f”F1 Score: {f1:.4f}”)

print(f”Feature importance (top 5):”)

importance_df = pd.DataFrame({

‘feature’: X.columns,

‘importance’: model.feature_importances_

}).sort_values(‘importance’, ascending=False).head(5)

for _, row in importance_df.iterrows():

print(f” {row[‘feature’]:❤5} {row[‘importance’]:.4f}”)

print(“=” * 60)

return auc, f1, importance_df

# Prepare binary target: high-fraud customers

features_df[‘is_high_risk’] = (features_df[‘fraud_rate’] > 0.01).astype(int)

feature_cols = [

‘num_accounts’,

‘total_balance’,

‘total_transactions’,

‘avg_transaction_amount’,

‘spend_last_90_days’,

‘txn_count_last_90_days’,

‘account_age_days’,

‘fraud_rate’

]

auc, f1, importance = run_tstr_evaluation(features_df, ‘is_high_risk’, feature_cols)

Output:

text

============================================================

TRAIN-SYNTHETIC-TEST-REAL (TSTR) EVALUATION

============================================================

Training samples: 1,330

Test samples: 571

AUC Score: 0.8923

F1 Score: 0.7614

Feature importance (top 5):

fraud_rate 0.4812

account_age_days 0.1923

avg_transaction_amount 0.1204

txn_count_last_90_days 0.0891

num_accounts 0.0652

AUC of 0.89 on the synthetic holdout is a strong signal that the feature pipeline is working correctly. If this number was 0.52, it would indicate that the features are not discriminative, either the synthetic data has broken relationships or the feature engineering logic is wrong.

The Complete Testing Checklist

Before your ML pipeline touches real production data, verify each of the following against your synthetic database:

Variable cardinality per parent table (not a fixed constant per customer or account)
Account age spans more than the longest rolling window in your feature set
NULL values present in nullable columns at realistic frequencies
LEFT JOIN results contain expected null rows (customers with no accounts, accounts with no transactions)
Rolling window features produce different values for different account tenures
Feature distributions fall within expected production ranges
TSTR evaluation achieves AUC above 0.75 (above random, below overfitting)
Feature importance rankings match domain expectations

Where This Approach Has Limits

This methodology is not a substitute for testing against real data. It is a substitute for the alternative: discovering bugs in production.

Two specific limits are worth noting.

Distribution shift: Your synthetic database is generated from assumptions about what production looks like. If those assumptions are wrong, your feature tests pass but your production model still fails. This is why the validation queries that check cardinality, account age, and null rates are as important as the feature tests themselves. Garbage in, garbage out applies to synthetic databases too.

Schema evolution: Every time your production schema changes, your synthetic generator needs updating. If a new column is added to transactions and your feature pipeline starts using it, your synthetic generator must reflect it. Teams that don’t maintain this discipline end up with a synthetic database that is three months behind production, which defeats the purpose.

Both limits are manageable. Neither makes the approach less valuable than the alternative, which is deploying untested feature pipelines to production and hoping they behave correctly.

The Bottom Line

Your feature engineering code is more likely to break your production model than your model code is. Yet most teams test their models extensively and their feature pipelines almost not at all.

A synthetic relational database with realistic cardinality, full account age coverage, deliberate null injection, and referential integrity gives you a test environment that catches join bugs, null propagation, rolling window edge cases, and temporal inconsistencies before production ever sees your code.

Build the test database first. Run your actual pipeline SQL against it. Validate the features. Then train. Anything less is optimism dressed up as engineering.

Multi-Table Feature Engineering on Synthetic Databases: How to Test Your ML Pipeline Before It Sees… was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Multi-Table Feature Engineering on Synthetic Databases: How to Test Your ML Pipeline Before It Sees Real Data

Feature engineering bugs hide in joins. Here is how to build synthetic relational databases that expose them before production does.

Leave a Comment