The Dirty Truth About Financial Data Nobody Talks About Before Building AI Models

I spent three weeks cleaning payment data before writing a single line of model code. Here’s what I learned.

Everyone wants to talk about the fancy part, the model architecture, the accuracy scores, the deployment pipeline. Nobody wants to talk about the 72 hours you’ll spend staring at a CSV file wondering why 40% of your transaction timestamps are in four different time zones with no documentation explaining why.

That’s the real job. And if you’re building AI-driven applications in finance or payments, getting this part wrong doesn’t just hurt your model performance but it can cost real money, trigger compliance issues, or worse, make your system confidently wrong in ways nobody catches until it’s too late.

Let me walk you through what actually works.

Why Financial Data Is a Special Kind of Nightmare

Financial data isn’t like other data. It carries weight i.e. regulatory, operational, and reputational. A missing value in a product recommendation dataset means someone gets a slightly worse movie suggestion. A missing value in a fraud detection pipeline means a transaction either gets wrongly blocked or a fraudulent charge slips through.

The stakes change how you have to think about every single step.

Here’s what makes payments and finance data uniquely painful to work with:

It’s fragmented by design. Payment data lives across acquiring banks, payment processors, card networks, internal ledgers, and third-party risk vendors. Each source has its own schema, its own quirks, and its own definition of what “settled” means.

It’s temporally sensitive. A transaction that happened at 11:58 PM versus 12:02 AM can mean completely different things from a fraud, tax, or reporting perspective. Timezone handling isn’t a minor detail but it’s a core data integrity issue.

It’s heavily regulated. PCI-DSS, GDPR, SOX, AML requirements aren’t just compliance checkboxes. They directly constrain what data you can store, how you can transform it, and who can access it during your preprocessing pipeline.

It’s imbalanced almost by definition. Fraud rates in most payment systems sit between 0.1% and 2%. If you don’t handle this during preprocessing, your model will learn to predict “not fraud” for everything and still hit 98% accuracy. Congratulations, you’ve built a useless model.

Step 1: Data Extraction & Getting the Raw Material Right

Before you can clean anything, you have to pull it. And how you pull it matters more than most people realize.

Know Your Sources Before You Touch Them

Map every data source before writing a single extraction query. In a typical payments environment, you’re looking at:

Transaction logs from payment processors (Stripe, Adyen, Braintree each exports differently)
Chargeback and dispute data from card networks
KYC/AML screening results from compliance vendors
Internal ledger data from core banking or accounting systems
Behavioral data like login events, session data, device fingerprints

Each of these has different latency, different update frequencies, and different reliability characteristics. Your fraud model needs near-real-time transaction data. Your credit risk model might be fine with daily batch pulls. Conflating these is how you introduce subtle data leakage that destroys your model’s real-world performance.

Point-in-Time Correctness Is Non-Negotiable

This is the one that bites people hardest. In finance, data gets restated. Transactions get reversed. Credit scores get updated. If you’re building a training dataset and you pull the “current” state of a customer’s credit profile, you’re including information that didn’t exist at the time the decision was made.

That’s data leakage. Your model will look incredible in backtesting and fail in production.

The fix: always extract data as it existed at the point in time you’re modeling. This means maintaining historical snapshots or using event-sourced data architectures. It’s more infrastructure work upfront, but it’s the only way to build models that actually generalize.

Practical Extraction Checklist

Before you move to preprocessing, make sure you can answer:

What is the exact timestamp format and timezone for each source?
What does a “null” value mean in this context. Is it missing, not applicable, or unknown?
Are there known data quality issues or outages in the historical record?
What’s the refresh cadence, and does it match your modeling requirements?
What PII fields exist, and what’s your handling protocol?

Step 2: Preprocessing — Where the Real Work Happens

Raw financial data is not model-ready data. The gap between the two is where most projects quietly fail.

Handling Missing Values in Financial Contexts

Generic advice says “impute with mean or median.” Financial data requires more nuance.

A missing income field on a loan application is not the same as a missing income field on a transaction record. In the first case, it might indicate the applicant didn’t disclose, which is itself a signal. In the second, it might just be a system integration gap.

What actually works:

Flag missingness explicitly. Create a binary indicator column for fields with significant missing rates. The fact that a value is missing is often predictive.
Use domain-informed imputation. For merchant category codes, missing values often cluster around specific acquirers or integration types. Imputing based on similar merchants in the same category is more defensible than global median imputation.
Don’t impute target-adjacent fields. If you’re predicting default risk and “days past due” has missing values, imputing that field is dangerous. Understand why it’s missing first.

Feature Engineering That Actually Matters

This is where you earn your money. Raw transaction fields: amount, merchant, timestamp are rarely what your model needs. What it needs is context.

Velocity features are the backbone of fraud detection. How many transactions has this card made in the last hour? Last 24 hours? How does today’s transaction amount compare to the 30-day average for this customer? These features require careful time-windowed aggregations, and they need to be computed without leaking future information.

Behavioral baselines. A $5,000 transaction from a customer who regularly makes $4,000-$6,000 purchases is very different from the same transaction from someone whose average is $50. Normalizing transaction amounts against customer-level baselines dramatically improves signal quality.

Merchant and network graph features. Payments don’t happen in isolation. The relationship between a cardholder, a merchant, an acquiring bank, and a card network carries information. Graph-based features for how many unique cardholders has this merchant processed in the last week? can surface patterns that row-level features miss entirely.

Time-based features. Day of week, hour of day, days since last transaction, days until statement close, all carry predictive signal in financial contexts. Encode cyclical time features (hour, day of week) using sine/cosine transformations rather than raw integers. Your model will thank you.

Dealing With Class Imbalance

I mentioned this earlier, but it deserves its own section because it’s so commonly mishandled.

For fraud detection with a 0.5% fraud rate, you have a few options:

Oversampling (SMOTE and variants): Generates synthetic minority class samples. Works reasonably well for tabular financial data, but be careful for synthetic fraud samples that don’t reflect real fraud patterns can actually hurt model performance. Apply oversampling only to your training set, never to validation or test sets.

Undersampling: Randomly removes majority class samples. Faster and simpler, but you’re throwing away real data. Use with caution on smaller datasets.

Cost-sensitive learning: Assign higher misclassification costs to the minority class directly in your model’s loss function. This is often the cleanest approach because it doesn’t alter the data distribution, it just tells the model that missing a fraud is more expensive than a false positive.

Threshold calibration: Don’t just use 0.5 as your classification threshold. In fraud detection, you’re almost always operating on a precision-recall tradeoff. Tune your threshold based on the actual business cost of false positives versus false negatives.

Normalization and Scaling

Transaction amounts span orders of magnitude. A $2 coffee and a $200,000 wire transfer can both be legitimate and both can be fraudulent. Log transformation of monetary amounts is standard practice and usually the right call. It compresses the range while preserving relative differences.

For neural network-based approaches, standard scaling (zero mean, unit variance) after log transformation works well. For tree-based models (XGBoost, LightGBM dominate in production fraud systems for good reason), scaling matters less, but log transformation of skewed monetary features still helps.

Categorical Encoding in High-Cardinality Financial Data

Merchant IDs, BIN numbers, IP addresses, device fingerprints. Financial data is full of high-cardinality categorical variables. One-hot encoding a field with 50,000 unique merchants will destroy your model’s performance and your memory budget.

Target encoding (replacing categories with their mean target value) works well here but requires careful cross-validation to avoid leakage. Use out-of-fold encoding.

Embedding layers for neural network approaches can learn dense representations of high-cardinality categoricals. This is increasingly the approach for large-scale payment systems.

Frequency encoding is replacing a category with how often it appears in the dataset. It is a simple baseline that often performs surprisingly well.

The Compliance Layer You Can’t Skip

Everything above assumes you’ve already handled your data governance requirements. In practice, you need to build compliance into your preprocessing pipeline, not bolt it on afterward.

PII tokenization before feature engineering. Card numbers, account numbers, and customer identifiers should be tokenized or hashed before they enter your feature engineering pipeline. You want to preserve the ability to link records without storing raw PII in your training data.

Audit trails. In regulated environments, you need to be able to explain what transformations were applied to data and why. Document your preprocessing steps as code, not just as comments. Use versioned pipelines.

Model explainability starts here. If you’re building credit decisioning models subject to adverse action notice requirements (ECOA, FCRA in the US), your preprocessing choices directly affect your ability to explain model outputs. Features that are transformations of protected characteristics can create disparate impact even if the protected characteristic itself isn’t in the model.

What Good Looks Like

After all of this, a well-prepared financial dataset for an AI application should have:

Clean, consistent timestamps with explicit timezone handling
No future information leaking into historical training windows
Explicit missingness indicators alongside imputed values
Engineered velocity, behavioral, and contextual features
Properly handled class imbalance (in training data only)
Tokenized PII with preserved linkage keys
A documented, versioned, reproducible preprocessing pipeline

That last point matters more than people realize. The model you train today will need to be retrained in six months. The preprocessing pipeline needs to run on new data without someone reverse-engineering what past-you was thinking.

The Honest Bottom Line

The difference between a financial AI application that works in production and one that looks great in a notebook demo is almost always in the data preparation. The model architecture matters. The feature selection matters. But none of it matters if you’re feeding the model data that doesn’t accurately represent the problem you’re trying to solve.

Spend the time here. It’s not glamorous. It won’t make for impressive conference slides. But it’s the work that determines whether your system actually performs when real money is on the line.

And in payments and finance, real money is always on the line.

If you’re working through similar challenges in financial ML or have a different approach to any of these steps, I’d genuinely like to hear about it in the comments. The field moves fast and there’s no single right answer to most of these problems.

If this was useful, follow along. I write about applied ML in finance, the gap between research and production, and the unglamorous work that actually makes systems work.

The Dirty Truth About Financial Data Nobody Talks About Before Building AI Models was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.