Why a 95% Accurate Model Can Still Be Wrong (Part 3)

Part 3 of a 4-part series: From Data to Decisions

At some point in most machine learning projects, a familiar moment arrives.

The model looks strong. Accuracy is high. ROC curves are smooth. Validation metrics are better than the baseline. Stakeholders are cautiously optimistic.

And yet, when the system goes live, something feels wrong.

Too many legitimate customers are blocked. Analysts complain about alert quality. Business teams lose trust. Regulators ask uncomfortable questions. The model has not failed technically, but the system is clearly not working.

This is not a modeling problem. It is a decision problem.

In Part - 1 and Part- 2 of this series, we focused on understanding data and shaping it into meaningful features. Those steps constrained what the model could learn. In this part, we confront a harder truth: even a well-trained model can produce poor outcomes if it is evaluated and used incorrectly.

Evaluation is not about proving that a model is good. It is about determining whether its decisions will hold up under real-world cost, risk, and scrutiny.

This is where most AI systems quietly break.

Evaluation Metrics and Why They Matter

Metrics are often treated as objective measures of success. In practice, they encode priorities.

Accuracy, AUC, precision, recall. None of these are wrong. But none of them are neutral either. Each one answers a different question, and choosing the wrong one can optimize the system toward outcomes the business never intended.

In fraud detection, for example, overall accuracy is almost meaningless. When fraud rates are below one percent, a model that predicts “not fraud” for every transaction will appear highly accurate and completely useless.

from sklearn.metrics import accuracy_score

accuracy_score(y_true, y_pred)

What matters instead is how errors are distributed.

False negatives allow fraud to pass through and create direct financial loss. False positives block legitimate customers and create indirect costs that are harder to quantify but equally real. Evaluation metrics must reflect this asymmetry.

Precision and recall begin to capture this tension.

from sklearn.metrics import precision_score, recall_score

precision_score(y_true, y_pred)
recall_score(y_true, y_pred)

But even these metrics are incomplete on their own. A model with high recall but poor precision may flood analysts with alerts. A model with high precision but low recall may miss emerging fraud patterns.

The key insight here is simple but often ignored: metrics do not describe model quality in isolation. They describe trade-offs.

Those trade-offs must align with business reality.

Precision–Recall vs ROC in Practice

ROC curves are popular because they look clean and mathematically elegant. They summarize model discrimination across thresholds. They are also frequently misunderstood.

In highly imbalanced systems, ROC curves can remain strong even when precision collapses at operating thresholds. This is why models that look excellent in ROC space can perform poorly in production.

Precision–Recall curves tell a more honest story in these environments.

from sklearn.metrics import precision_recall_curve

precision, recall, thresholds = precision_recall_curve(y_true, y_scores)

Unlike ROC, Precision–Recall focuses explicitly on the positive class. It makes visible the trade-off between catching more fraud and maintaining alert quality.

In banking systems, this distinction matters. Analysts do not review ROC curves. They review alerts. Customers do not experience AUC. They experience declines.

This is why experienced teams evaluate models at specific operating regions, not across abstract curves. The question is not whether the model separates classes well in theory. The question is whether it performs acceptably at the thresholds where decisions will actually be made.

A model that dominates another globally may still be inferior in the narrow slice of behavior that matters operationally.

Understanding this early prevents teams from celebrating improvements that never translate into better decisions.

Threshold Selection and Cost Sensitivity: Where Models Become Decisions

Every classification model eventually faces a question it cannot answer on its own.

At what point does a score become a decision?

Models produce probabilities, scores, or ranks. Businesses operate on actions. Decline or approve. Flag or pass. Escalate or ignore. The threshold that connects these two worlds is not a technical constant. It is a business choice with financial, operational, and reputational consequences.

This is where many teams make their most consequential mistake. They treat threshold selection as a tuning exercise rather than a decision design problem.

In fraud systems, lowering a threshold increases recall and catches more fraud. It also increases false positives, blocks more customers, and floods analysts. Raising the threshold improves customer experience but allows more fraud through. There is no correct threshold in isolation. There is only a threshold that aligns with current risk appetite, operational capacity, and regulatory posture.

Cost sensitivity is the missing layer here.

A false negative has a measurable cost: chargebacks, reimbursements, investigation time. A false positive has a different kind of cost: customer churn, call center volume, brand damage, and regulatory scrutiny if declines appear discriminatory or opaque.
These costs are rarely symmetric, and they are rarely static.
Experienced teams make this explicit. They assign approximate costs to different error types and evaluate thresholds against expected impact, not abstract metrics.

Conceptually, the logic is simple.

What does it cost us to miss fraud at this score range?
What does it cost us to block a legitimate transaction?
How many decisions will this threshold generate at scale?

Once framed this way, threshold selection becomes a business conversation grounded in data, not a post-processing afterthought.

Importantly, thresholds are not permanent. They change with fraud patterns, customer behavior, seasonal effects, and operational capacity. Treating them as fixed values is a recipe for drift-related failures.

This is why mature systems separate model training from decision thresholds. The model learns patterns. The business controls how aggressively those patterns are acted upon.

Error Analysis: Understanding Failure Before It Scales

Metrics summarize behavior. Error analysis explains it.

Once a threshold is chosen, the next question is not “How good is the model?” but “Where does it fail, and why?”

False positives and false negatives are not symmetric mistakes. They cluster in different regions of the data and often tell very different stories.

False positives frequently occur at the edges of normal behavior. Customers traveling, merchants processing delayed settlements, new devices, unusual but legitimate spending patterns. If these cases dominate alerts, analysts lose trust quickly.

False negatives, on the other hand, often reflect blind spots. New fraud tactics, low-amount probing, slow-burn behavior that looks benign in isolation. These cases reveal where features and assumptions lag reality.

Error analysis should therefore be segmented, not aggregated.

Questions worth answering include:

Are false positives concentrated in specific customer segments?
Do false negatives spike at certain times or channels?
Are errors correlated with missing data or fallback flows?
Do errors increase after policy or system changes?

These patterns are rarely random. They usually map back to decisions made earlier in data selection, feature engineering, or threshold design.

In production environments, error analysis also serves a governance function. It provides evidence that the system is understood, monitored, and actively managed. This matters when models are questioned by auditors or challenged by business stakeholders.

Teams that skip this step often discover issues only after customers complain or losses spike. Teams that invest here see problems while they are still contained.

Explainability: From Model Behavior to Business Trust

Explainability is often framed as a regulatory requirement. In practice, it is a trust requirement.

Regulators want to understand decisions. Business leaders want to defend them. Analysts want to act on them. Customers want explanations that make sense. A model that cannot support these needs becomes a liability, regardless of its accuracy.

Global explainability answers one question: What does this model generally rely on?

Feature importance rankings help identify dominant signals, unexpected dependencies, and potential bias. They are especially useful for sanity checks. If a model relies heavily on a feature that is unstable, delayed, or poorly understood, that is a warning sign.

Local explainability answers a different question: Why was this specific decision made?

This matters more operationally. When a transaction is declined or an alert is raised, someone will ask why. The explanation must be coherent, consistent, and aligned with business logic.

Crucially, explainability should reflect the decision, not just the score. Explaining why a score is high is not the same as explaining why action was taken. The threshold, context, and rules around the score are part of the explanation.

This is where many systems fall short. They explain models but not decisions.

Mapping Scores to Decisions: The Final Design Layer

The most important realization in production AI systems is this: models do not make decisions, systems do.

Between a model score and an action lies a policy layer. This layer may include thresholds, rules, overrides, human review, and escalation paths. It exists whether teams design it explicitly or not.

In well-designed systems, this layer is intentional.

A low score may always pass through. A high score may trigger immediate action. Scores in between may be routed for additional checks or human review. Different thresholds may apply to different segments or channels. Certain features may trigger overrides regardless of score.

This structure allows systems to balance automation with control. It also allows gradual changes without retraining models for every policy shift.

Most importantly, it creates accountability. Decisions can be explained as the outcome of a model operating within a clearly defined framework, rather than as opaque algorithmic outputs.

This is the difference between deploying a model and operating a decision system.

Reader Takeaway: Why Decision Design Matters More Than Model Scores

By the time a model produces a score, most of the hard work should already be done.

Success in production is not defined by impressive validation charts. It is defined by stable behavior, predictable trade-offs, and decisions that withstand scrutiny.

This is why decision design matters more than model scores.

In the final part of this series, we will focus on what most teams underestimate until it is too late: monitoring, drift, governance, and operating AI systems in the real world.

If this perspective aligns with your experience, I’d value your thoughts in the comments.
If you’re following the series, Part 4 will complete the journey from data to decisions.

And if this work is useful, you’re welcome to follow me here on Medium for the final part and future writing grounded in real production systems.

Why a 95% Accurate Model Can Still Be Wrong (Part 3) was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.