Part 1 of a 4-part series: From Data to Decisions

Series Context
This is a four-part series built the way real machine learning work happens in banks, not the way it is shown in demos or notebooks.
- Part 1 focuses on data understanding and exploratory analysis. This is where most projects silently succeed or fail.
- Part 2 will turn those insights into features and models, and show why feature engineering still matters more than algorithms.
- Part 3 will move beyond accuracy and into decision logic, thresholds, and explainability, where most AI systems break in production.
- Part 4 will deal with deployment, monitoring, drift, and governance, the part that separates experiments from enterprise systems.
Each part stands on its own, but together they reflect how experienced teams actually build AI systems.
This article focuses entirely on Part 1.
When Banking Decisions Were Slow but Understandable
Early in my career, most banking decisions were not made by models. They were made by people supported by rules, reports, and experience built over years of exposure to edge cases.
Fraud analysts began their day with queues generated overnight. Credit officers reviewed applications with scorecards, bureau reports, and policy checklists laid out on their desks. AML investigators escalated alerts based on scenarios that had been debated, documented, and signed off months earlier. These systems were not fast, and they were rarely elegant, but they had one powerful characteristic that modern systems often lack: decisions were explainable by default.
If a transaction was declined, the reason was visible. If a loan was rejected, the exact policy clause could be pointed out. If an alert was raised, the rule that triggered it was known and documented. Accountability was built into the process, not added later.
The same pattern existed beyond core banking.
In equity trading desks, portfolio managers relied on market reports, historical price ranges, and macro signals rather than black-box strategies. A trader could explain why a position was taken based on volume patterns, earnings announcements, or sector movement. In forex operations, dealers watched spreads, liquidity windows, and economic calendars. Decisions were slower, sometimes manual, but they were grounded in intuition reinforced by data that humans could reason about.
These approaches worked because the scale was manageable.
As transaction volumes grew, market velocity increased, and digital access removed natural friction, these human-centered systems began to fracture. Card networks started processing millions of transactions per minute. Stock markets shifted from human traders to algorithmic execution. Forex prices moved continuously across time zones, reacting to global events in seconds rather than hours.
Static rules became brittle. Thresholds that once made sense started firing constantly. Fraud analysts were flooded with false positives. Trading systems overreacted to noise. Risk reports arrived after the market had already moved on.
The problem was not that the old systems were wrong. The problem was that they could no longer keep up.
Decisions that once took minutes or hours now needed to happen in milliseconds. Human review became a bottleneck. Heuristics that worked in stable environments collapsed under non-linear behavior and feedback loops.
This is where machine learning entered the picture.
Models promised speed, scale, and pattern recognition beyond human capacity. They could detect subtle correlations in transaction flows, price movements, or behavioral shifts that no rulebook could encode. In theory, this was the natural evolution.
But here is the part that is rarely discussed honestly.
Most machine learning projects did not fail because the models were weak. They failed because teams treated data as a technical input rather than an operational reality. Historical labels were assumed to be ground truth. Missing values were treated as noise. Market regime changes were ignored. Context was stripped away in the rush to train.
In banking, trading, and forex systems alike, teams moved too quickly from data extraction to modeling, skipping the uncomfortable work of understanding what the data actually represented.
And the cost of that shortcut almost always surfaced later.
Why Data Understanding Comes Before Everything Else
In regulated banking environments, data is never clean, never neutral, and never complete. Anyone who has worked closely with production systems knows this instinctively, even if it rarely makes it into project documentation.
Transaction data carries the scars of real operations. Retries, reversals, partial authorizations, offline approvals, and reconciliation delays all leave their mark. Customer data reflects years of mergers, migrations, and policy changes. Fields that look identical across systems often mean slightly different things. Labels frequently encode human judgment, operational constraints, and hindsight, rather than objective ground truth.
Time matters. Context matters. Sequence matters.
And missing values are almost never random.
A missing merchant category may indicate fallback routing. A missing device ID may reflect channel limitations rather than user behavior. A delayed fraud label may be the result of chargeback timelines, not detection failure. Treating these patterns as noise is one of the fastest ways to build a misleading model.
Before building any model, experienced teams spend serious time understanding what the data actually represents in day-to-day operations.
- Not what the column name suggests.
- Not what the schema documentation claims.
- But what the value means when a transaction is retried at 2 a.m., when a customer travels internationally, or when a system fails over to a backup processor.
This is where exploratory data analysis earns its place.
EDA is not about producing beautiful charts for presentations. It is about interrogating the data until it reveals its assumptions. It exposes where business logic leaked into labels, where operational processes shaped distributions, and where regulatory constraints quietly influenced outcomes.
In fraud systems, EDA often reveals that certain customer segments are overrepresented in alerts due to conservative thresholds rather than actual risk. In credit data, it shows how historical policy decisions influence default labels. In AML systems, it uncovers how scenario-driven alerts bias training data toward what was previously known, not what was necessarily risky.
Done properly, EDA surfaces risks, biases, constraints, and opportunities early, when they are still cheap to address.
Skipped or rushed, those same issues reappear later as unexplained model behavior, poor production performance, audit findings, or loss of business trust.
This is why teams that have built real systems treat data understanding not as a preliminary step, but as a form of risk management. It is the quiet work that prevents very public failures.
And it is also why, in most mature projects, the most important modeling decisions are already visible before the first algorithm is trained.
Problem Definition and Business Context
Let us anchor this discussion in a realistic, production-grade example.
Assume we are building a transaction-level fraud detection system for card payments operating in near real time. Every swipe, tap, or online purchase must be evaluated within milliseconds, often before the authorization response is sent back to the merchant.
The business objective in such a system is rarely to maximize model accuracy. Accuracy is a reporting metric, not a business outcome.
What actually matters is reducing confirmed fraud losses while minimizing customer friction, operational cost, and regulatory exposure. A false negative allows fraud to pass through and creates direct financial loss. A false positive blocks a legitimate customer, damages trust, and generates avoidable calls, complaints, and churn. Excessive alerts overwhelm analysts and increase operational fatigue. Latency violations can break payment flows entirely. Explainability is not optional because every declined transaction must be defensible to regulators, partners, and customers.
These constraints exist simultaneously, and they often pull in different directions.
Our target variable in this scenario is whether a transaction was later confirmed as fraudulent. That confirmation usually arrives through chargebacks, customer disputes, or post-facto investigations. This means the label is delayed, incomplete, and influenced by human processes. Not all fraud is reported. Not all disputes are fraud. Some fraud is refunded silently. Others are discovered weeks later.
Already, this introduces label delay, survivorship bias, and operational noise into the dataset.
A transaction that appears legitimate today may be labeled as fraud weeks later. Another may never be labeled at all because the customer chose not to dispute it. Some labels reflect policy thresholds rather than true risk. These realities shape what the model can and cannot learn.
Understanding this context is not a formality. It directly influences how we explore the data, how we interpret distributions, how we treat missing values, and how we evaluate model performance later.
EDA, in this setting, is not an academic exercise. It is how we ensure that the model we eventually build reflects how the bank actually experiences fraud, not how clean datasets imagine it.
And this is why, in real fraud systems, the most important design decisions are already being made at the problem definition stage, long before the first feature is engineered or the first model is trained.
Data Sources and Schema Overview
A realistic fraud dataset in a banking environment is rarely a single, neatly curated table. It is usually an aggregation of signals pulled from multiple operational systems, stitched together over time.
At a high level, such a dataset typically includes several broad categories of information.
- Transaction-level attributes capture what happened at the moment of payment. This includes the transaction amount, currency, merchant category, channel type such as POS, e-commerce, or in-app, country codes, and response indicators from upstream networks. These fields often look straightforward, but they frequently hide operational complexity such as retries, reversals, partial approvals, and fallback routing.
- Customer attributes provide historical and contextual grounding. Tenure with the bank, account age, geography, prior transaction behavior, spending patterns, and previous fraud exposure all fall into this category. These features are rarely static. They evolve over time, sometimes asynchronously, depending on upstream refresh cycles and data availability.
- Device and location signals add another layer of context. Device fingerprints, IP addresses, velocity indicators, and geolocation estimates can be powerful, but they are also unevenly available. Digital channels tend to be rich in such signals, while offline or fallback flows may carry very little.
- Labels are usually derived from chargebacks, customer disputes, or post-transaction investigations. These labels arrive late, are incomplete, and often reflect operational thresholds rather than absolute truth. They are shaped as much by customer behavior and bank policy as by actual fraud activity.
- Timestamps capture when events occurred, when they were processed, and when they were labeled. In real systems, these are rarely aligned. Event time, processing time, and label time often differ, and that difference matters.
Before doing any exploratory analysis, I always start by understanding structure and scale, not statistics.
import pandas as pd
df = pd.read_csv("transactions.csv")
df.shape
df.head()
df.info()
These simple checks answer several critical questions immediately.
- How many records are we dealing with, and over what time span?
- Which columns are numeric, categorical, or temporal?
- Where are values missing, and how systematically?
- Are data types aligned with operational meaning, or merely with storage convenience?
In many projects, issues surface at this stage that later explain model behavior. Amount fields stored as strings due to currency symbols. Boolean flags encoded inconsistently across systems. Timestamps parsed without time zones. Identifiers reused across migrations.
This step is not about fixing problems yet. It is about forming a mental model of how the data was produced.
Understanding the schema at this level helps distinguish between true signals and artifacts of system design. It also prevents incorrect assumptions from creeping into EDA later, where patterns can look statistically meaningful but operationally misleading.
In regulated environments, this discipline is not optional. Schema misunderstandings have a habit of resurfacing during audits, incident reviews, or unexplained production drift.
This is why experienced teams treat schema exploration as part of risk control, not just data preparation.
And it is also why, before plotting a single distribution, it is worth knowing exactly what kind of system produced the numbers you are about to analyze.
Initial Data Quality Checks
In banking systems, data quality checks are not a hygiene step. They are an early warning system.
Most production issues I have seen did not originate in complex modeling logic. They surfaced later as unexplained spikes in alerts, sudden drops in performance, or uncomfortable audit questions. In almost every case, the root cause was visible much earlier in the data.
The first objective here is not to clean the data. It is to understand how and why it might be unreliable.
Missing Values Are Rarely Accidental
The instinctive reaction to missing values is to count them and plan imputation. In regulated environments, that instinct needs to be slowed down.
df.isnull().mean().sort_values(ascending=False)
A missing value often tells a story.
A missing device ID may indicate a fallback authorization path. A missing merchant category could be the result of upstream enrichment failures. A missing customer attribute might reflect delayed batch updates rather than absence of information.
Treating all missing values as random noise removes information that could later help explain model behavior or decision outcomes.
Before deciding how to handle missing values, it is critical to understand which systems produce them and under what conditions.
Duplicates and Repeated Records
In transactional systems, duplicates are rarely simple duplicates.
df.duplicated().sum()
A repeated transaction ID may represent retries due to network timeouts. Multiple records with the same attributes may correspond to reversals or partial approvals. Removing duplicates blindly can erase important signals about customer behavior or system stress.
Instead of asking “Should we drop duplicates?”, the better question is “What operational process created them?”
Range and Validity Checks
Basic range checks often reveal more than expected.
df["transaction_amount"].describe()
Negative amounts, extreme values, or impossible timestamps are not just data errors. They often reflect refunds, chargebacks, currency conversions, or system corrections. Understanding these cases early prevents misinterpretation later during EDA or modeling.
Categorical fields deserve the same scrutiny.
df["channel"].value_counts()
Unexpected category values often point to undocumented system changes or legacy mappings that no one remembers until something breaks.
Label Sanity Checks
Labels deserve special attention because they directly influence learning.
df["is_fraud"].value_counts(normalize=True)
In fraud datasets, extreme imbalance is expected. What matters more is consistency. Sudden changes in fraud rate over time may indicate policy changes, reporting delays, or operational backlogs rather than real shifts in behavior.
It is also important to remember that labels often reflect what was detected, not everything that happened. Undetected fraud does not appear as a negative label. This bias is not fixable, but it must be understood.
Temporal Consistency
Time is a silent dependency in most banking data.
df["event_time"] = pd.to_datetime(df["event_time"])
df["label_time"] = pd.to_datetime(df["label_time"])
(df["label_time"] < df["event_time"]).sum()
Cases where label timestamps precede event timestamps are not uncommon in merged datasets. They usually indicate data integration issues, not time travel. Left unnoticed, they introduce leakage and invalidate evaluation later.
Understanding event time, processing time, and label time differences is essential before any modeling decision is made.
Why This Step Matters More Than It Appears
Initial data quality checks are not about making the dataset perfect. They are about making it honest.
They help surface:
- Operational artifacts disguised as patterns
- Bias introduced by processes rather than behavior
- Hidden assumptions baked into labels
- Risks that only become visible under scale
In many real-world projects, the decisions made at this stage quietly shape everything that follows. Models trained on misunderstood data often look impressive offline and fail ungracefully in production.
This is why experienced teams treat data quality checks as part of model risk management, not just preprocessing.
Only once the data’s limitations are understood does exploratory analysis begin to add real value.
In the next section, we will move into exploratory data analysis itself and see how these early observations influence what we trust and what we question in the patterns that emerge.
Exploratory Data Analysis That Actually Matters
Exploratory data analysis is often presented as a visual exercise. Plot a few distributions, compute correlations, generate heatmaps, and move on. In real banking systems, that version of EDA is mostly decorative.
EDA that actually matters is investigative. It asks why patterns exist, not just whether they exist. It treats anomalies as clues rather than inconveniences. Most importantly, it is guided by how decisions are made in production, not by what looks interesting in a notebook.
The goal is not to understand the data in isolation. The goal is to understand how the data behaves when it is used to make real-time decisions under regulatory and operational constraints.
Univariate Analysis: Establishing Baselines
Univariate analysis is the first checkpoint. It answers a simple but critical question: does each feature behave the way the business expects it to behave?
df["transaction_amount"].describe()
In fraud datasets, transaction amounts are rarely symmetric. They are heavy-tailed, skewed, and often segmented by channel. A mean value is almost meaningless here. Medians and percentiles tell a more honest story.
Plotting distributions is useful, but interpretation matters more than visualization.
df["transaction_amount"].hist(bins=50)
A long tail may indicate high-value fraud attempts, legitimate high-net-worth customers, or operational artifacts such as corporate transactions flowing through consumer pipelines. Each interpretation leads to a different modeling decision later.
The same applies to categorical variables.
df["channel"].value_counts(normalize=True)
Unexpected dominance of a channel often reflects routing logic rather than customer preference. Treating it as behavior rather than system design leads to misleading conclusions.
Target Variable Distribution: Accepting Imbalance as Reality
In most fraud systems, the target variable is severely imbalanced.
df["is_fraud"].value_counts(normalize=True)
This is not a problem to fix. It is a constraint to design around.
EDA at this stage helps internalize that accuracy will be misleading, that recall and precision trade-offs will matter, and that threshold selection will later become a business decision rather than a purely technical one.
Ignoring this reality early almost guarantees disappointment later.
Bivariate Analysis: Where Signal Starts to Appear
Bivariate analysis begins to reveal whether features behave differently for fraudulent and legitimate transactions.
df.groupby("is_fraud")["transaction_amount"].median()Differences here often validate intuition but sometimes challenge it. Fraud may cluster at low amounts to avoid detection. Or it may spike at specific thresholds tied to authorization limits.
Time-based features are particularly revealing.
df["transaction_hour"] = pd.to_datetime(df["event_time"]).dt.hour
df.groupby("transaction_hour")["is_fraud"].mean()
Fraud patterns frequently align with human and system rhythms. Late-night spikes, weekend behavior, or anomalies around salary cycles are common. These patterns inform feature engineering later, but only if they are understood first.
Multivariate Analysis: Context Over Correlation
In complex systems, single features rarely explain behavior on their own. Context matters.
Rather than chasing correlation matrices, experienced teams explore combinations.
For example, transaction amount behaves very differently when segmented by channel or geography. A high amount online may be suspicious, while the same amount in a physical store may be routine.
EDA at this level is less about statistics and more about asking structured questions:
- Does this feature behave differently under different conditions?
- Are patterns consistent across time?
- Do they align with known operational constraints?
These questions prevent over-generalization during modeling.
Missing Values and Outliers: Signals in Disguise
Missing values and outliers are often where the most useful information hides.
Instead of removing them, EDA asks why they exist.
A missing device ID may correlate strongly with fraud simply because certain channels lack device instrumentation. An extreme transaction amount may represent corporate spending misrouted into a consumer pipeline.
Flagging these conditions often works better than correcting them.
df["high_amount_flag"] = df["transaction_amount"] > df["transaction_amount"].quantile(0.99)
df["missing_device_flag"] = df["device_id"].isnull()
These flags preserve information and remain explainable during audits.
What Good EDA Prevents
When done properly, EDA prevents a class of failures that rarely get blamed on data but almost always originate there.
- It prevents models from learning system artifacts instead of behavior.
- It prevents leakage disguised as strong performance.
- It prevents features from being trusted blindly.
- It prevents decision logic from being built on unstable signals.
In many mature banking projects, the most impactful modeling decisions are already visible at this stage. EDA reveals which features deserve trust, which require caution, and which should not be used at all.
This is why EDA is not an exploratory step. It is a design phase.
And it is also why teams that invest here tend to spend less time explaining failures later.
Key Insights and Observations
By the end of exploratory data analysis, we should not merely have a collection of charts and summary tables. We should have a clearer understanding of what the data can be trusted to tell us and, just as importantly, where it is likely to mislead us.
At this stage, several questions should already have credible answers.
- Which features consistently behave differently for fraudulent and legitimate transactions, and which only appear predictive because of operational side effects? Some variables show strong separation early on, while others look promising until segmented by time, channel, or geography. EDA helps distinguish genuine behavioral signals from artifacts created by routing logic, enrichment gaps, or policy thresholds.
- Where does data quality introduce silent risk? Missing values, inconsistent categories, and delayed labels often align with specific operational scenarios rather than randomness. These patterns matter because models trained on such data tend to overestimate their confidence and underperform when conditions change.
- How do class imbalance and label latency shape everything that follows? Extreme imbalance means that traditional performance metrics will be misleading. Delayed labels mean that recent data cannot be evaluated honestly. Both constraints are visible during EDA, long before evaluation frameworks are chosen, and they should influence how success is defined later.
- Which assumptions no longer hold once the data is examined closely? EDA often exposes beliefs that seemed reasonable on paper but collapse under scrutiny. Assumptions about customer behavior, transaction timing, or data completeness frequently turn out to be overly simplistic. Identifying these early prevents them from hardening into model logic.
In many projects I have been involved with, the eventual modeling failures were not mysterious. They could be traced back to insights that were already visible at this stage but quietly ignored under delivery pressure or optimism about what modeling would fix.
EDA rarely tells us how to build the perfect model. What it does do is narrow the space of safe decisions. It clarifies what is stable, what is fragile, and where caution is required.
This is why experienced teams treat the end of EDA as a decision point rather than a milestone. The question is not whether we are ready to train a model. The question is whether we understand the data well enough to justify doing so.
And in many real-world systems, the quality of that understanding determines the quality of everything that follows.
Why Most Modeling Mistakes Are Already Visible Here
By the time a project reaches the modeling stage, many of its outcomes are already determined.
This is uncomfortable to admit because it challenges a common narrative in machine learning that better algorithms or more tuning will rescue weak results. In regulated banking environments, that belief rarely holds. Models do not fail in isolation. They fail because they are built on misunderstood data, incomplete context, or unsafe assumptions that were visible much earlier.
Exploratory data analysis is where those assumptions surface.
- When features appear highly predictive during EDA but only within narrow operational windows, that fragility does not disappear during training. When labels are delayed, biased, or shaped by historical policy decisions, models learn those biases faithfully. When missing values correlate strongly with outcomes because of system behavior rather than customer intent, models internalize system artifacts as risk signals.
- None of this is subtle. It is simply easy to overlook when the pressure to deliver a model outweighs the discipline to question the data.
- Many of the most common modeling failures have clear precursors at this stage. Models that overfit to historical behavior often rely on features whose distributions are unstable over time. Models that perform well offline but degrade quickly in production are frequently trained on data that leaks future information through timestamp misalignment. Models that struggle under regulatory review often depend on variables that were never fully understood from an operational standpoint.
- EDA exposes these risks early, when they are still cheap to address.
- It forces uncomfortable questions. Can this feature be trusted when the system changes? Will this pattern survive a policy update or market shift? Is this signal capturing behavior or merely reflecting how we labeled the past?
- Teams that take these questions seriously tend to build models that generalize, degrade gracefully, and withstand scrutiny. Teams that rush past them often end up compensating later with complex overrides, post-hoc explanations, and reactive governance.
This is why EDA is not a preliminary step to modeling. It is the stage where modeling success is either enabled or quietly undermined.
In the next part of this series, we will move from understanding data to shaping it. Feature engineering and modeling will make these early decisions concrete, for better or worse.
And by then, the influence of what we saw here will be unmistakable.
Closing Thoughts
Strong machine learning systems are rarely defined by the sophistication of their models. They are defined by how well the data behind those models is understood.
In regulated environments like banking, that understanding is not optional. It determines whether a system earns trust, survives audits, and behaves predictably under pressure. Exploratory data analysis is where that trust is first established. It is where assumptions are tested, risks are surfaced, and the boundaries of what a model can safely do become clear.
Teams that treat EDA as a formality often discover its importance only after deployment, when corrections are costly and explanations are required. Teams that invest here tend to build systems that are not only more accurate, but more resilient, interpretable, and defensible.
This article focused on the first part of a larger journey, from data to decisions. In the next part, we will move into feature engineering and modeling, and examine how early data insights shape every technical choice that follows.
If these reflections align with your experience, I would be interested in hearing your perspective. Practical insights from the field often add more value than theory alone.
If you found this useful, consider liking or sharing it with others who work on real-world ML systems. And if you would like to follow along as this series continues into modeling, decision logic, and production realities, feel free to follow my work here.
Real systems are built through shared learning. Conversations are part of that process.
From Raw Data to Insights: A Practical Guide to EDA (Part 1) was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.