Why Feature Engineering Still Beats Fancy Models (Part 2)

Part 2 of a 4-part series: From Data to Decisions

Why Most Models Are Decided Before Training Begins

Most machine learning projects do not fail because the model was poorly chosen. They fail because the features were easy to compute rather than meaningful to the decision being made.

By the time a model is trained, many outcomes are already constrained. The shape of the data has been accepted. The assumptions embedded in features have hardened. The system has quietly decided what it will and will not be able to learn.

This is uncomfortable for teams that enjoy experimenting with algorithms, because it shifts responsibility away from model selection and back toward design discipline. It suggests that progress is not unlocked by switching from one algorithm to another, but by making better choices earlier.

In the first part of this series, we focused on data understanding and exploratory analysis. Not as a reporting exercise, but as a way to surface operational realities, biases, and constraints before they became embedded in models.

This part picks up from there.

Feature engineering is where insights from EDA are either respected or ignored. It is the point at which understanding turns into structure. And in regulated banking systems, this is where most projects quietly succeed or fail.

Key Learnings Carried Forward from EDA

EDA does not tell us how to build the perfect model. What it does do is narrow the space of safe decisions. By the time exploratory analysis is complete, certain paths should already be ruled out.

In fraud systems, EDA often reveals that some features behave predictively only under very specific conditions. A variable may separate fraud from non-fraud overall, but collapse when segmented by channel, geography, or time. That fragility does not disappear during training. It becomes a source of instability in production.

EDA also exposes where labels reflect process rather than behavior. Delayed chargebacks, conservative alert thresholds, or investigation backlogs shape what the model sees as truth. Features that correlate strongly with these artifacts may perform well offline and fail the moment operating conditions change.

Imbalance and latency surface early as well. When fraud rates are extremely low and labels arrive weeks later, evaluation becomes constrained. Some metrics become misleading. Some validation strategies become unsafe. These are not modeling problems. They are design constraints.

Perhaps most importantly, EDA forces uncomfortable questions about assumptions. Are we capturing customer behavior, or system behavior? Are we modeling risk, or past policy decisions? Are missing values accidental, or informative?

Every one of these insights should influence feature choices.

When teams ignore them, feature engineering becomes mechanical. When teams respect them, feature engineering becomes intentional.

Feature Engineering as Decision Design

Feature engineering is often taught as a collection of techniques. Scaling, encoding, aggregation, transformation. In practice, it is closer to decision design.

Every feature answers a question the model is allowed to ask.
Every transformation decides how much context the model can see.
Every aggregation encodes a belief about what matters over time.

In banking systems, these decisions carry consequences far beyond model performance. Features influence explainability, auditability, latency, and operational trust. A feature that improves AUC but cannot be explained under review is not an improvement. A feature that relies on delayed data may look powerful during training and fail silently in real time.

This is why experienced teams think about features in terms of what decision they support, not how easily they can be computed.

A raw transaction amount is rarely meaningful on its own. Its significance depends on customer history, recent velocity, channel norms, and context. Encoding it without that context invites misinterpretation. Aggregating behavior over carefully chosen windows often tells a more stable story than any single event.

Similarly, categorical features are not just values to encode. They reflect how the system categorizes the world. Merchant categories, channel codes, and customer segments are shaped by upstream logic, not natural laws. Encoding them blindly often teaches the model to learn system design rather than risk.

Feature engineering done well narrows uncertainty. Done poorly, it amplifies noise.

This is why, in many production systems, simpler models trained on well-designed features outperform complex models built on convenience features. The difference is not algorithmic sophistication. It is respect for what the data actually represents.

In the sections that follow, we will move into specific feature choices, validation strategies, and early modeling decisions. But the outcome of those steps is already influenced by what we decide here.

Transformations and Scaling Without Breaking Meaning

Transformations are often treated as a technical necessity. Scale the data, normalize distributions, move on. In real banking systems, transformations are rarely neutral. They change how a model interprets risk, magnitude, and deviation.

Consider transaction amount, one of the most common features in fraud and payments data. Raw amounts are heavily skewed, and models that assume linearity struggle with them. The instinctive response is to apply a log transform.

import numpy as np

df["log_amount"] = np.log1p(df["transaction_amount"])

Technically, this works. Statistically, it stabilizes variance. Operationally, it introduces a decision.

A log-transformed amount compresses the difference between high-value transactions. That may improve model stability, but it also reduces sensitivity exactly where fraud losses are highest. In some systems, this is acceptable. In others, it is not.

The point is not whether to transform, but why.

Scaling decisions have similar implications. Standard scaling assumes stable distributions. In banking data, distributions drift. Seasonal effects, policy changes, and market behavior alter baselines.

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
df["scaled_amount"] = scaler.fit_transform(df[["transaction_amount"]])

This introduces another assumption: that the past is a reliable reference for the future. In long-running systems, that assumption weakens over time.

For this reason, many production systems avoid aggressive global scaling and prefer relative or contextual features. Ratios, deltas, or customer-normalized values often preserve meaning better than absolute normalization.

df["amount_vs_customer_avg"] = (
    df["transaction_amount"] / df["customer_avg_amount"]
)

This transformation encodes behavior rather than magnitude. It asks whether a transaction is unusual for this customer, not whether it is large in absolute terms. That distinction matters more to decisions than mathematical elegance.

Transformations should clarify intent, not obscure it. If a transformation makes a feature harder to explain, audit, or reason about, its performance gain rarely survives contact with production.

Categorical Encoding and the Illusion of Signal

Categorical variables are where many models quietly overfit.

Merchant category codes, channels, countries, device types, customer segments. These fields look descriptive, but they often reflect system design choices, not natural categories. Encoding them without context teaches models to learn those design decisions rather than underlying behavior.

One-hot encoding is usually the first approach.

pd.get_dummies(df["merchant_category"], prefix="mcc")

For low-cardinality features, this can work. In real banking data, cardinality is rarely low. Merchant categories evolve. New values appear. Old ones disappear. Sparse representations grow unstable.

More importantly, one-hot encoding assumes that categories are independent. In practice, many are proxies for similar risk profiles shaped by routing, geography, or regulation.

Target encoding appears attractive as an alternative.

target_means = df.groupby("merchant_category")["is_fraud"].mean()
df["merchant_risk"] = df["merchant_category"].map(target_means)

This often boosts offline performance dramatically. It also introduces one of the most dangerous forms of leakage if not handled carefully. When labels are delayed or biased, target encodings amplify historical decisions rather than current risk.

In regulated environments, this creates two problems. First, the feature becomes difficult to explain without deep statistical context. Second, it bakes past policy behavior directly into the model.

Experienced teams treat categorical encoding as a risk surface, not a preprocessing step.

Common safeguards include:

Encoding based on historical windows only
Grouping rare categories into stable buckets
Creating domain-informed mappings rather than purely data-driven ones

For example, grouping merchant categories by business type rather than code granularity.

df["merchant_group"] = df["merchant_category"].map(mcc_to_group_mapping)

This sacrifices some granularity but gains stability and explainability. In many production systems, that trade-off is deliberate.

The most important question to ask with categorical features is not how to encode them, but what they truly represent. If a category reflects how a system labels activity rather than how risk manifests, encoding it blindly creates fragile models.

This is why categorical features often look powerful early and disappoint later. The illusion of signal fades when the system around them changes.

Aggregations and Domain Features: Where Experience Shows

If feature engineering has a place where experience outweighs technique, this is it.

Raw transactional features describe events. Aggregations describe behavior. And in most banking systems, behavior generalizes far better than individual events.

A single transaction amount tells us very little in isolation. The same amount can be routine for one customer and anomalous for another. What matters is how that transaction compares to recent history, typical patterns, and contextual baselines.

This is why aggregation windows matter more than models.

Consider a simple rolling aggregation.

df = df.sort_values(["customer_id", "event_time"])

df["txn_count_24h"] = (
    df.groupby("customer_id")["transaction_id"]
      .rolling("24h", on="event_time")
      .count()
      .reset_index(level=0, drop=True)
)

This feature does not describe a transaction. It describes pace. It captures behavioral acceleration, which is often more predictive than magnitude.

Domain-driven aggregates go further.

In fraud systems, velocity across channels matters. In credit systems, payment consistency matters. In trading or forex data, volatility over recent windows often matters more than price level.

df["amount_sum_7d"] = (
    df.groupby("customer_id")["transaction_amount"]
      .rolling("7d", on="event_time")
      .sum()
      .reset_index(level=0, drop=True)
)

The power of these features is not statistical. It is conceptual. They encode how humans reason about risk, translated into machine-readable form.

Time windows are design choices, not defaults. A 1-hour window captures bursts. A 7-day window captures habits. A 30-day window captures norms. Choosing between them is a business decision disguised as feature engineering.

Domain features often outperform generic ones because they encode institutional knowledge. A feature like “number of international transactions in last 48 hours” exists because someone understands customer behavior, not because an algorithm discovered it.

These features are also easier to explain. When auditors or business stakeholders ask why a transaction was flagged, aggregated behavior is easier to defend than abstract embeddings or opaque encodings.

This is why many production systems rely on relatively simple models paired with carefully designed aggregates. The model becomes a decision combiner, not a pattern miner.

In practice, this is where feature engineering stops being a technical exercise and becomes system design.

Train–Validation–Test Strategy: A Business Decision in Disguise

Data splitting is often treated as a mechanical step. In real systems, it is one of the most consequential design choices.

Random splits assume independence. Banking data violates that assumption almost everywhere. Customer behavior evolves. Policies change. Fraud adapts. Markets shift.

A random split may look statistically sound and still be operationally meaningless.

In time-dependent systems, the safest default is temporal separation.

train = df[df["event_time"] < "2023-01-01"]
valid = df[(df["event_time"] >= "2023-01-01") & (df["event_time"] < "2023-03-01")]
test  = df[df["event_time"] >= "2023-03-01"]

This split answers a real question: Can this model trained on the past perform on the future?

It also exposes uncomfortable truths early. Performance often drops compared to random splits. That drop is not a failure. It is honesty.

Validation strategy is also tied to label availability. In fraud systems, labels lag events. Recent data may not be fully labeled. Including it in validation can silently bias results.

Experienced teams accept smaller, cleaner validation sets over larger, misleading ones.

This section connects directly to the next one. Once the split reflects reality, the first model we build must serve a specific purpose.

Not performance. Perspective.

Baseline Models: The Most Honest Signal You Will Get

Baseline models are often described as starting points. In practice, they are reality checks.

A simple logistic regression or decision tree trained on well-understood features tells you something critical: how much signal is actually present.

from sklearn.linear_model import LogisticRegression

model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

If a baseline performs surprisingly well, it suggests the problem is structurally learnable. If it performs poorly, no amount of model complexity will rescue weak features or flawed labels.

Baselines also expose leakage early. When a simple model performs too well, it is often because it learned something it should not have been allowed to know.

Equally important, baselines anchor expectations. They prevent teams from mistaking relative improvements for absolute progress.

In production systems, baselines are rarely discarded. They remain reference points. They act as fallbacks. They provide sanity checks when complex models drift.

This is why baselines deserve to be evaluated using the same validation logic as advanced models. Their role is not to win. It is to tell the truth.

Once this foundation is set, model selection becomes meaningful rather than exploratory.

And that is where Part 3 of this series will go next: how to evaluate models, choose thresholds, and design decisions that survive contact with the real world.

Closing Thoughts

Feature engineering is often discussed as a technical craft. In practice, it is a discipline of restraint.

The choices made here determine what a model is allowed to learn, how stable it will be under change, and whether its decisions can be explained when it matters most. Well-designed features reduce uncertainty. Poorly chosen ones amplify noise, even when models appear strong on paper.

In this part of the series, we moved from understanding data to shaping it. Aggregations, encodings, validation strategies, and baselines were not treated as techniques, but as design decisions grounded in operational reality.

In the next part, the focus will shift from models to decisions. We will look at evaluation metrics, threshold selection, explainability, and why high accuracy often fails to translate into reliable outcomes in production systems.

If this perspective aligns with your experience, I would value hearing how you approach these trade-offs in real projects. Feel free to share your thoughts in the comments.

If you found this useful, consider liking or sharing it with others working on real-world ML systems. And if you’d like to follow along as the series continues, you’re welcome to follow my work here. Thanks !!!

Real systems improve through shared experience and honest discussion.

Why Feature Engineering Still Beats Fancy Models (Part 2) was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.