Yes, Your AI Product Hallucinates. No, the Model Is Not the Only Reason.

This is not a guide on catching hallucinations. This is about why they happen in the first place, and what your architecture has to do with it.

A legal research tool cites a case that does not exist. A customer support bot confidently quotes a refund policy the company never had. A financial report summariser invents a quarterly figure that looks perfectly plausible.

These are not edge cases from research papers. These are production systems, built by competent teams, running on capable models.

And in every instance, the first instinct was the same: “We need a better model” or “We need better prompts.”

That instinct is not wrong. It is just incomplete.

I have spent years building GenAI products. The single most important lesson I have learned about hallucination is this: it is never just one thing. It is the result of decisions made across three distinct layers, from model selection to system architecture to product design. Most teams obsess over one layer and neglect the other two.

This post is not about evals. It is not about detection or mitigation frameworks. There are excellent guides on those. This is about the problem itself: where it comes from, and why understanding the full stack of decisions that shape hallucination is the highest leverage insight you can have.

What Does Hallucination Actually Look Like in Production?

“Hallucination” is an umbrella term that hides at least five distinct failure modes. Each one has different causes and demands a different response.

Factual fabrication is the one everyone pictures. The model invents facts, names, dates, or citations. It did not misread the source. It never saw one. It generated a plausible answer because that is what next token prediction does when grounding is absent.

Confident wrongness is subtler and arguably more dangerous. The model gives a definitive answer to a question it should have hedged on. There is no uncertainty signal. The user has no way to tell the difference between a correct response and a fabricated one.

Source attribution errors plague RAG-based products. The model generates a correct sounding statement and ties it to a source that either does not say that thing or does not exist. A Stanford study of legal RAG tools found hallucination rates between 17% and 33%, even with retrieval augmentation. The citations looked right. The substance was wrong.

Logical inconsistency shows up in long form generation. Paragraph one says revenue was up. Paragraph three says it declined. Both sound authoritative. Neither flags the contradiction.

Context window amnesia is specific to longer interactions. Research on the “lost in the middle” phenomenon shows that LLMs deprioritise information buried in the middle of large contexts. The model effectively “forgets” what it was given.

Which of these failure modes is your product most exposed to? That question should shape every decision that follows.

Three Tiers of Hallucination Control

Here is the mental model I wish someone had given me when I started building these systems.

Hallucination is not controlled at one layer. It is shaped by decisions across three tiers, each with a different role:

Tier 1: Model-level decisions set the floor. These determine your baseline hallucination rate before any architecture or product design comes into play.

Tier 2: Architectural decisions set the ceiling. These are the highest leverage interventions because they compound across every single interaction.

Tier 3: Product and organizational decisions determine real-world impact. These decide whether the hallucinations that do slip through cause harm or get caught.

Most teams pour all their energy into Tier 1, switch models, tweak temperature, hope for the best. Some invest in Tier 2. Very few think about Tier 3 at all.

The teams that ship AI products with consistently low hallucination rates invest across all three.

Tier 1: Model-Level Decisions (Setting the Floor)

Let me be clear! model selection matters. Different models hallucinate at significantly different rates on the same task, with the same prompt, and the same architecture.

A smaller, cheaper model will fabricate more frequently on knowledge-intensive tasks than a frontier model. A model fine-tuned on domain-specific data hallucinates less in that domain. Reasoning-oriented models behave differently than standard completion models. Temperature settings alone can shift hallucination rates meaningfully.

These are real levers. Picking the right model is the first and most foundational decision you make.

But here is the catch. Model selection has a ceiling. You can pick the best model available and still hallucinate badly if your architecture is poor. You cannot buy your way out of a system design problem by upgrading the model. I have seen teams switch from one frontier model to another and see marginal improvement because the real issue was upstream, bad retrieval, monolithic prompts, no validation checkpoints.

Think of it this way:

Model choice sets the floor. It determines the best your system can possibly do under ideal conditions.
Architecture determines how close you actually get to that floor. Most systems operate far above their model’s floor because the architecture introduces noise, bad context, and compounding errors.

So yes, choose your model carefully. Match it to your task. Tune your parameters. But do not stop there. The rest of this post is about the decisions that have an even bigger impact on what your users actually experience.

Tier 2: Architectural Decisions (Raising the Ceiling)

This is where most teams underinvest and where the largest gains live.

Prompt and Constraint Design

Most teams focus on making prompts more detailed. What actually moves the needle is making them more constrained.

Have you ever seen a system prompt that says “Answer the user’s question helpfully and accurately”? That is practically an invitation to hallucinate. Maximum latitude. No guardrails for uncertainty.

Now compare it to:

“Answer using only the provided context. If the context does not contain enough information to answer confidently, say so explicitly. Do not infer beyond what the context supports.”

Same model. Same question. Dramatically different hallucination rate.

The principle: every degree of freedom you give the model is a degree of freedom where hallucination can enter.

Constrain the output format (structured JSON vs free text)
Constrain the knowledge source (only retrieved context vs parametric knowledge)
Constrain the uncertainty behaviour (explicit “I do not know” vs best-effort guess)
Constrain the scope (answer only about X, refuse questions about Y)

The tighter the box, the less room for fabrication.

Retrieval Pipeline Quality

Most teams adopt RAG and assume the hallucination problem is solved.

It is not. It is relocated.

RAG shifts the failure mode from “the model made something up from training data” to “the model made something up from bad retrieval results.” And bad retrieval is far more common than teams realise.

Think about what can go wrong:

Your chunking strategy splits a critical paragraph across two chunks. The retriever fetches only one half.
Your embedding model misses the semantic nuance. It returns topically related but factually irrelevant documents.
Your knowledge base is six months stale but nobody has refreshed it.
Your reranker is not tuned. The most relevant document lands at position 4 while a marginal one takes position 1.
Your metadata is incomplete. The retriever cannot filter by date, document type, or source authority.

In every scenario, the model receives context that looks relevant but misleads. And it does exactly what it was designed to do: generate a fluent, confident answer grounded in whatever it received.

Garbage retrieval in, confident hallucination out.

If you are building a RAG product, the retrieval pipeline deserves as much architectural attention as the generation layer. Possibly more.

Context Window Architecture

Here is something that surprised me early on. You can give the model the right information and it will still hallucinate, because of where that information sits in the context.

LLMs disproportionately attend to information at the beginning and end of the context window. Information buried in the middle gets deprioritised. If your critical evidence is at token position 50,000 out of 100,000, the model may effectively ignore it.

This means context management is not just an efficiency concern. It is a hallucination mitigation strategy.

How you order retrieved documents matters
How much context you include matters (more is not always better)
Whether you summarise or truncate matters
Where critical information is positioned matters

The question is not “does the model have access to the right information?” It is “is the right information positioned where the model will actually attend to it?”

Workflow Decomposition

This is the factor I want to spend the most time on. It is the most under-appreciated and the most powerful.

Here is the pattern I see in most teams building their first GenAI product:

They have a complex task
They write one big prompt that describes the entire task
They send it to the model
The output is inconsistent: sometimes brilliant, sometimes hallucinated
They make the prompt longer and more detailed
Hallucination does not improve meaningfully
They conclude the model is not good enough

Sound familiar?

The problem is not the model or the prompt. The problem is asking one LLM call to do too many cognitive tasks simultaneously.

When a single call must retrieve context, reason over it, identify patterns, synthesise findings, and generate polished output, you create an enormous surface area where hallucination can enter at any stage. And you have zero visibility into which stage failed.

The alternative: Try decomposing!

Instead of one prompt that says “analyse this data and write a summary,” split it:

An analyst step that examines the data and produces structured observations, hypotheses, and supporting evidence
A synthesis step that takes those observations and generates the final summary

The analyst does not worry about polished writing. The synthesiser does not worry about data interpretation. Each step has a narrower scope, clearer constraints, and a much smaller hallucination surface area.

This is not a hypothetical. This pattern, separating analysis from synthesis, consistently produces more grounded outputs than a monolithic prompt.

And there are many variations of the same principle:

A verification step between generation and output that checks claims against sources
An evaluator agent that scores output for groundedness before it reaches the user
A planning step that decomposes the task, followed by specialised execution steps, each with only the context it needs

The core principle: every time you split a complex task into specialised steps with clear boundaries, you reduce hallucination surface area. Every time you add a checkpoint between steps, you create an opportunity to catch fabrication before it compounds.

Is there a cost? Yes. More LLM calls mean more latency and more token spend.

But here is the tradeoff most teams miss: a monolithic prompt that hallucinates 20% of the time is not cheaper than a multi-step workflow that hallucinates 3% of the time. The cost of a hallucination reaching a user, in trust erosion, support tickets, and reputational damage, almost always exceeds the cost of an extra LLM call.

There is no one-size-fits-all decomposition pattern. The right architecture depends on your task, your accuracy requirements, and your latency budget. But the principle is universal: decompose, specialise, and validate between steps.

Tier 3: Product and Organizational Decisions (Managing Real-World Impact)

A well-chosen model and a well-designed architecture will dramatically reduce hallucination. But they will not eliminate it entirely. Tier 3 is about what happens when hallucination inevitably slips through.

Product Specification and Scope

This is where I want to talk directly to PMs.

Most product managers do not think of themselves as having a role in hallucination. That is an engineering problem, right?

Wrong. PMs make at least seven decisions in every AI product spec that directly determine how often the product hallucinates and how much damage it causes. Most of these decisions are made implicitly, which is exactly the problem.

Defining the AI’s knowledge boundary

Most specs say “the AI should help users with X.” Almost none say “the AI should refuse to answer Y.”

That missing boundary is where hallucination lives. Without an explicit scope, the model attempts to answer everything, including questions it has no grounding for. A legal tool that says “I cannot advise on tax law, that is outside my scope” is a product decision that prevents an entire category of fabrication. Most PMs never write that line into the spec.

2. Choosing the AI’s autonomy level per task

Not every feature needs the same level of AI independence. The PM decides whether the AI:

Drafts (human reviews before it goes out)
Suggests (human picks from options)
Acts (AI executes without review)

Each level has a fundamentally different hallucination risk profile. “AI auto-generates and sends the customer report” is a different risk decision than “AI drafts the report for analyst review.” This is not an engineering call. This is a product call about acceptable risk.

3. Designing the fallback experience

What happens when the AI does not know?

Most specs do not answer this question. So the engineer makes a default choice, which is usually: let the model try anyway. The PM should define what the user sees when the AI is uncertain. Is there a graceful handoff to a human? An explicit “I do not have enough information” response?

The absence of a designed fallback path guarantees that hallucination reaches users unfiltered.

4. Deciding whether AI output is presented as authoritative or suggestive

“The answer is…” and “Based on available documents, it appears that…” are two completely different trust contracts with your users. The framing, the visual weight, the confidence of the language: these are product decisions. They determine whether a hallucination becomes a minor inconvenience or a trust-destroying event.

5. Specifying where human checkpoints exist

In multi-step workflows, the PM decides which steps require human review and which run autonomously. Every AI-to-AI handoff without a checkpoint is a product decision to accept compounding hallucination risk.

“The analysis agent feeds directly into the report generator which sends to the client” is a pipeline where no human ever validates the intermediate output. That is not an architecture choice. That is a product risk tolerance choice.

6. Defining metrics that actually catch hallucination

A ZDNET/Aberdeen survey found that many users are already tired of AI features being bolted onto products. If your success metrics are only adoption, task completion, and user satisfaction, you will miss hallucination entirely. A user can complete a task, feel satisfied, and have received a fabricated answer.

If the PM does not define hallucination rate, factual accuracy, or groundedness as tracked metrics alongside the usual KPIs, nobody will measure them. What does not get measured does not get fixed.

7. Owning the trust contract with users

Hallucination liability is real. Legal tools fabricating citations. Financial tools inventing numbers. Medical tools generating incorrect interactions.

The PM decides the product’s trust posture: is this a product that says “trust our AI” or “verify with our AI”?

That single framing decision cascades into every design, engineering, and support choice downstream.

UX and Presentation Layer

Take the exact same hallucinated output and present it two ways.

Version A: Same font, same styling, same layout as verified database content. No source attribution. No “AI generated” label. No confidence signal.

Version B: Subtle “AI generated” label. Inline citation links to source documents. Confidence indicator. A “verify this” affordance that lets users click through to the evidence.

The hallucination rate is identical. The harm is not even close.

Research from Nielsen Norman Group shows that confident presentation style from AI systems makes users significantly less likely to question incorrect answers. The design creates false authority.

Key design decisions that matter:

Visual distinction between AI-generated and verified content
Inline citations that let users trace claims to sources
Confidence signals (even subtle language shifts like “it appears” vs “it is”)
Verification affordances that make it easy to check, not just read
Feedback mechanisms that let users flag inaccurate outputs

This is not about adding disclaimers. It is about designing for the reality that your AI will sometimes be wrong, and helping users navigate that reality rather than hiding it.

Evaluation and Feedback Architecture

Most teams test hallucination during development, launch, and stop measuring. But hallucination rates drift. Knowledge bases go stale. User query patterns shift into unanticipated domains. The rate you measured at launch may not reflect what users experience six months later.

The architectural question is not “did we test for hallucination?” It is:

Do users have a way to flag inaccurate outputs?
Does that feedback reach someone who can act on it?
Does the knowledge base get refreshed on a defined cadence?
Are production outputs sampled for groundedness on an ongoing basis?

If the answer to most of these is no, you have a system that was tested once and is degrading silently.

The Compounding Risk in Agentic Systems

One factor crosscuts all three tiers and deserves its own callout.

In agentic workflows where multiple AI steps feed into each other, a small hallucination in step two becomes the factual premise for step three. Step three builds on it. Step four synthesises from both. By the final output, the hallucination is deeply embedded and nearly impossible to detect.

No single step looks obviously wrong. The system as a whole produces confidently wrong results.

This compounding gets worse as agent chains get longer. The response comes from every tier:

Tier 1: Use the most capable model for high-stakes reasoning steps, cheaper models for simple execution
Tier 2: Validate between steps. Do not pass raw LLM output from one agent to the next without grounding
Tier 3: Design human checkpoints at critical junctions, not just at the final output

The Takeaway

Hallucination is not a single-layer problem. It is a system property that emerges from the interaction between your model, your architecture, and your product design.

The teams that build AI products with low hallucination rates work across all three tiers:

They choose the right model for their task and tune its parameters carefully (Tier 1)
They decompose complex tasks, constrain the model’s freedom, and design their retrieval pipeline with rigor (Tier 2)
They define clear boundaries in their product spec, give users tools to verify, and treat hallucination monitoring as an ongoing concern (Tier 3)

A great architecture on a decent model will outperform a poor architecture on the best model. But the best outcomes come from investing across all three tiers, not from over-indexing on any single one.

The next time your AI product hallucinates, do not start by swapping the model. And do not start by tuning the prompt. Start by asking- at which tier did the chain of decisions break down? The answer will point you to the highest-leverage fix.

Yes, Your AI Product Hallucinates. No, the Model Is Not the Only Reason. was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.