RAG Architecture in Retail (Azure-based) :A Real Enterprise Use Case

Retrieval-Augmented Generation—Retail Flavor

Retailers possess a wealth of information, from detailed product specifics and stock levels to pricing strategies and customer engagement history. Yet, much of this valuable data often goes untapped when it comes to refining customer experiences or guiding crucial internal decisions. Existing search functionalities and chatbot interfaces frequently deliver generic, or sometimes even outdated, information. They struggle to truly grasp what a user intends to find, and rarely integrate the most current business context. The direct consequence of this gap is often a frustrating product discovery journey for customers, inconsistent support interactions, and ultimately, missed chances to generate revenue.

Retail Data Pipeline

This is where Retrieval-Augmented Generation, or RAG, offers a compelling solution. It works by intelligently integrating advanced search capabilities with generative AI, allowing systems to pinpoint and retrieve highly relevant, current retail data, which then informs the creation of precise and contextually appropriate responses. Consequently, retailers gain the ability to offer significantly more personalized shopping journeys, enhance their operational efficiency, and establish a framework for making truly data-driven decisions across their entire operation.

Retrieval-Augmented Generation (RAG) grounds an LLM’s responses in your actual retail data — product catalogs, inventory, pricing, FAQs, order history — rather than relying on generic training knowledge. It’s the backbone of smart retail assistants, search, and recommendation engines.

Phase 1 — Data Ingestion

This is where most retail RAG projects fail silently. As architect we need to define the data contract — what enters the pipeline and in what form.

Retail data sources are messy:

Product catalogs (CSV/JSON),
PDFs (return policies, brand guides),
Relational DBs (inventory, pricing),
Review streams (semi-structured text),
Promotional data that changes daily.

We designed connectors for each — batch pipelines for static content, streaming pipelines for live inventory and pricing. Defined a clear schema at ingestion: every document must carry a source_type, last_updated, and a domain_entity tag (product, policy, FAQ) before anything else happens.

Pipeline at a Glance

The ingestion phase has four sequential layers before data reaches the RAG chunking stage:

STEP 1 — RETAIL DATA SOURCES

Every retail org has the same five categories of knowledge. Each has a different format, owner, and update cadence.

STEP 2 — INGESTION MODES

Batch Ingestion (Azure Data Factory) Used for: Product catalog, policies, FAQs, brand content.
Streaming Ingestion (Azure Event Hubs + Stream Analytics) Used for: Inventory levels, live pricing, orders.
Event-Triggered Ingestion (Logic Apps + Functions) Used for: Flash sales, promotional content, emergency policy updates.

STEP 3 — VALIDATION, DEDUP & ERROR HANDLING

Schema validation (Azure Functions) — Data contract document signed off — every source mapped to source_type, format, frequency, owner
Deduplication (Azure Cosmos DB) — re-ingest same file, verify zero duplicates in ADLS
Staleness / TTL enforcement — TTL per source type documented and enforced in Cosmos DB registry

Obervability & Monitoring in Ingestion Pipeline

Ingestion observability is non-negotiable in production. We wire these metrics into Azure Monitor from day one.

Phase 2 — Chunking & Enrichment

This is most impactful design decision. Chunk too large and retrieval is imprecise; chunk too small and you lose context.

For retail, product descriptions we need different chunking than policy documents. A product chunk should ideally contain: name, SKU, category path, key attributes, and price — all in one chunk so retrieval always returns a complete, actionable unit. Policy documents can use semantic chunking (split at paragraph boundaries, preserve section headers as metadata).

Apply metadata enrichment aggressively: add brand, category_l1/l2/l3, price_band, in_stock as filterable fields. This is what enables hybrid filtering later.

Invest in a ParentDocumentRetriever pattern — store small chunks for precise retrieval, but return the larger parent document to the LLM for richer context.

Chunking is the single most impactful decision in a RAG pipeline. A bad chunking strategy will defeat a great embedding model and a great retrieval algorithm. So we have to choose the correct chunking strategy.

Document Processing & Chunking Strategies

Fixed Size / Sliding Window

Best for: General documents, web pages, articles
Split every N tokens with an overlap window. Simple, fast, deterministic. Overlap prevents important context from being split across chunk boundaries.
Parameters to tune:
chunk_size: 400–600 tokens (sweet spot)
chunk_overlap: 10–15% of chunk_size
separator: paragraph > sentence > word

When to use: Default strategy. Start here and upgrade only when evaluation scores show it’s insufficient.

Retail Use: Supplier contracts have tables, numbered clauses, cross-references that fixed splits destroy. This is not good for our case.

2. Semantic / Embedding-based

Best for: Mixed-topic documents, transcripts, long-form content
Split based on semantic similarity between sentences — a new chunk starts when the topic changes. Uses embeddings to detect topic boundaries.
Parameters to tune:
• breakpoint_threshold_type: percentile or standard_deviation
• breakpoint_threshold_amount: 95th percentile typical
• buffer_size: 1–3 sentences around boundary
Trade-off: Higher quality chunks, but requires embedding model at ingestion time — slower and more expensive than fixed-size.

Retail Use: If RAGAS scores are insufficient with recursive, upgrade to semantic. Test cost vs quality improvement. We can use this but with some tradeoffs.

3. Hierarchical (Parent-Child)

Best for: Structured documents — reports, specs, legal contracts, manuals
Store large parent chunks for context but retrieve small child chunks for precision. When a child chunk matches, fetch its parent for full context delivery to the LLM.
Best practice:
• Parent: 1024–2048 tokens (full section)
• Child: 128–256 tokens (specific fact)
• Index children, store parents separately
Result: High retrieval precision (small chunks match better) + high generation quality (LLM gets full parent context).

Retail Use: Supplier contracts have a natural section → clause → sub-clause hierarchy. Retrieve the specific clause, generate from the full section. Best for our case.

4. Document-Structure Aware

Best for: Legal documents, contracts, technical manuals, spreadsheets with known structure
Parse document structure (headers, tables, lists, sections) and chunk along structural boundaries rather than character counts.
Preserve structure of:
• Tables → keep as single chunk or CSV format
• Numbered lists → keep as unit
• Code blocks → never split mid-code
• Markdown headers → use as chunk boundaries
Tools: Unstructured.io, PyMuPDF, MarkItDown (Microsoft)

Retail Use: Good for the pricing tier tables within contracts. Too granular for general contract text.

The Overlap Parameter — Why It Matters

We used RecursiveCharacterTextSplitter with 500 tokens and 60-token overlap as the baseline (Section 4.2 of a supplier contract = ~380 tokens, fitting cleanly in one chunk). We add a pre-processing step using Azure Document Intelligence to extract tables as structured JSON before chunking — tables split poorly with text chunkers and pricing tiers are the most-queried content. Tables become structured chunks with preserved column headers.

Metadata — The Hidden Multiplier

Every chunk stored in the vector index should carry metadata fields. These are not used for embedding similarity — they’re used as pre-filters on the search, reducing the candidate pool before vector comparison even begins. Good metadata design can improve retrieval precision by 30–40% for filtered queries.

Phase 3 — Embedding & Indexing

Embeddings

An embedding model converts text into a point in high-dimensional space. Similar meaning = nearby points. Understanding the geometry of this space — dimensions, distance metrics, model architecture — is what separates an engineer who uses RAG from one who designs it.

How Embeddings Work

A transformer encodes text into a fixed-length vector. Semantically similar texts produce similar vectors (small cosine distance). The model is trained on massive datasets to capture meaning, not just keyword overlap.

Why Embedding Quality Matters

“What is our refund policy?” and “How do I get my money back?” must map to the same region of vector space. A weak model treats them as different. A strong model (text-embedding-3-large) understands they’re semantically equivalent. This directly determines retrieval recall.

The Three Similarity Metrics

The Dimensionality Decision : Embedding Model Comparison

text-embedding-3-large generates 3,072-dimensional vectors.

Each dimension is a 32-bit float = 4 bytes. A single vector = 12,288 bytes ≈ 12KB. For 34,000 chunks: 34,000 × 12KB = ~408MB of raw vector storage. That’s manageable. But what if you have 10 million chunks? 120GB of vectors is expensive and slow to search.

1. 256 Dimensions

8.3% of full. ANN candidate retrieval at 12× less storage. <5% quality loss with MRL. Our choice for the ANN pass.

2. 512 Dimensions

Good balance. 6× storage saving. ~2–3% quality loss. Alternative if 256 is too aggressive for your domain vocabulary.

3. 1,536 Dimensions

text-embedding-3-small native. 2× storage saving from full. Use when cost matters more than top quality.

4. 3,072 Dimensions

Full text-embedding-3-large. Maximum quality. Used for reranking pass only in our two-stage retrieval.

We’re not picking just one index — retail RAG needs at least two working in parallel.

A dense vector index (Pinecone, Weaviate, pgvector) handles semantic similarity — “affordable running shoes” finds “budget athletic footwear” even with zero keyword overlap.

A sparse/keyword index (BM25 via Elasticsearch or OpenSearch) handles exact matches — SKU codes, brand names, model numbers. Neither alone is sufficient. Build your ingestion pipeline to write to both simultaneously. For category-aware retrieval, consider a lightweight graph structure representing your product taxonomy — it helps navigate “show me similar products in this category” queries.

Decision: Use text-embedding-3-small or a fine-tuned retail model (fine-tune on your product-query pairs if you have search logs). Embedding quality is our ceiling — no retrieval trick overcomes a bad embedding model.

Phase 4 — Retrieval & Reranking

Retrieval — Finding the Right Chunks

Retrieval is not just “vector search”. A production retrieval pipeline has four stages — each one adding precision that the previous stage couldn’t achieve alone. Understanding why each stage exists is what separates a demo RAG system from a production one.

This is the runtime critical path — it runs on every user query, so latency matters.

The flow is: query rewriting → hybrid search → reranking → metadata filtering.

Query rewriting uses a small LLM call to expand the user’s intent (“do you have red Nikes in size 10?” → extract: category=footwear, brand=Nike, color=red, size=10).

Hybrid search runs dense + sparse in parallel and fuses scores (Reciprocal Rank Fusion works well). A cross-encoder reranker (like Cohere Rerank or a local model)then re-scores the top-20 candidates by actual semantic relevance, not vector proximity.

Finally, apply metadata filters as hard constraints: filter out-of-stock items, enforce price ranges, restrict to the user’s geography.

BM25 (Best Match 25)

is a probabilistic ranking algorithm — the foundation of traditional search engines. It scores each document chunk based on: how many query terms appear in the chunk (term frequency), adjusted for how rare those terms are across the whole corpus (inverse document frequency), and normalised for chunk length.

Hybrid Search — Dense + Sparse

Dense search (embeddings) excels at semantic similarity but misses exact keyword matches. Sparse search (BM25/TF-IDF) excels at exact terms but misses paraphrases. Hybrid combines both via Reciprocal Rank Fusion (RRF).

Combine dense (semantic) and sparse (keyword/BM25) retrieval, using reciprocal rank fusion (RRF) to merge results. This improves recall for queries with rare terms or proper nouns.

Reranking:

Apply cross-encoder rerankers (e.g., Cohere Rerank, BGE-reranker) to top-N candidates for fine-grained relevance scoring.

The Reranking Stage — Why Not Just Take Top-10?

The hybrid search + RRF gives us a ranked list of ~50 candidates. Giving all 50 to GPT-4o would be expensive (50 × 500 tokens = 25,000 tokens of context per query) and would trigger the “lost in the middle” problem — GPT-4o attention degrades on chunks in the middle of a long context window.

The cross-encoder reranker solves this. Where the bi-encoder (embedding model) encodes query and document separately and then compares them, the cross-encoder encodes query and document together, allowing full cross-attention between them. This is much more accurate — but too expensive to run on millions of candidates. We use it only on the 50 shortlisted candidates from hybrid search.

Azure AI Search includes Microsoft’s semantic reranker (based on a fine-tuned language model) as a managed service — no self-hosting required. Pass it the hybrid search results; it returns a reranked list with a semantic relevance score per chunk.

Query Transformation: Use LLMs to generate query variants or hypothetical answers (HyDE) to improve retrieval for ambiguous or domain-specific queries.

Phase 5 — Generation & Response

The LLM only sees what you put in its context window. Your prompt engineering is the product.

Build a structured prompt that includes: a system persona (retail assistant role), retrieved context chunks with source labels, the user query, and explicit instructions to cite sources and say “I don’t have that information” when context is absent.

Add guardrails — A lightweight classifier that checks if the generated answer makes claims not supported by the retrieved context (hallucination detection). Inject citations back into the response so users can verify price or policy claims. For structured queries (product comparisons, price checks), consider routing to a structured output mode rather than free-form text.

Context Assembly Architecture

1. Deduplicate Retrieved Chunks

Remove near-duplicate chunks by cosine similarity (>0.98 = duplicate). Keeps context window focused on diverse information.

2. Format Citations

Prepend each chunk with structured citation markers: [SOURCE: filename, page X]. LLM is instructed to reference these in its answer.

3. Respect Token Budget

System prompt + context + user query + expected response must fit in context window. GPT-4o: 128K tokens. Typical budget: 6K for context, 1K for response.

4. Lost-in-the-Middle Mitigation

LLMs attend better to beginning and end of context. Place highest-scoring chunks first AND last. Least relevant in the middle.

5. Faithfulness Enforcement

System prompt explicitly instructs: “Answer ONLY from the provided context. If the answer is not in the context, say so.”

Evaluation — RAGAS & Beyond

RAGAS (Retrieval Augmented Generation Assessment) is the industry-standard framework for evaluating RAG pipelines. It measures both the retrieval quality and generation quality with reference-free metrics

1. Faithfulness ≥ 0.90

Measures if the generated answer is grounded in the retrieved context. Are claims supported by source documents? The anti-hallucination metric.

Root Cause Location: Generation layer — LLM not grounded

2. Answer Relevance ≥ 0.85

Measures if the answer actually addresses the question asked. A perfectly faithful answer can still be off-topic. This catches that.

Root Cause Location: Generation layer — verbosity

3. Context Precision ≥ 0.80

Measures if the retrieved chunks are actually relevant. Are you retrieving noise? Lower precision = LLM gets distracted by irrelevant context.

Root Cause Location: Retrieval layer — wrong chunks retrieved

4. Context Recall ≥ 0.75

Measures if ALL relevant information was retrieved. Are you missing important chunks? Requires ground truth annotations — most expensive to compute.

Root Cause Location: Retrieval layer — relevant chunks not retrieved

5. Context Entity Recall ≥ 0.70

Did the retrieved context include the key entities (people, places, concepts) needed to answer the question? Entity-focused recall metric.

6. E2E Latency ≥ 0.2s

Total time from user query to complete response. P95 target: under 2 seconds for enterprise chat applications. Critical UX metric.

Building Your Evaluation Dataset

RAGAS needs a dataset of questions + ground truth answers to evaluate against. For 340 supplier contracts, manually writing 200 Q&A pairs is expensive. We use a synthetic eval generation approach: GPT-4o reads each contract section and generates plausible questions a procurement specialist might ask, along with the ground truth answer extracted from the text. The generated pairs are reviewed by one procurement specialist before being added to the eval set — human review, not human creation.

End-to-End Evaluation Strategies and the RAG Triad

The RAG Triad: Retrieval, Answer, Faithfulness

End-to-end evaluation must capture the interplay between retrieval, generation, and grounding. The RAG triad organizes this as:

Retrieval Relevance: Did the retriever surface the right evidence?
Answer Relevance: Did the answer address the user’s query?
Faithfulness: Was the answer strictly grounded in the retrieved context?

Evaluation Frameworks:

RAGAS: Provides reference-free metrics for context recall, faithfulness, answer relevance, and context precision. Integrates with LangChain, LlamaIndex, and custom pipelines.
ARES: Focuses on retrieval evaluation with synthetic queries and LLM judges.
LLM-as-Judge: Uses a secondary LLM (e.g., GPT-4) to grade answers for correctness and support from context.
Human Review: Structured sampling and review loops to validate automated metrics and catch edge cases.

Core Retrieval Metrics

Retrieval quality is the linchpin of RAG performance. Key metrics include:

Generation Quality Metrics

The generation phase must produce answers that are not only fluent but also faithful (grounded in retrieved context), relevant (addressing the query), and format-compliant (matching output contracts). Key metrics include:

Observability Layer

Full tracing, evaluation, and cost monitoring

Azure Monitor + App Insights — infrastructure
LangSmith — AI chain tracing
RAGAS — automated quality evaluation
Azure Cost Management — token cost tracking
Custom SIEM alerts — security events

Production Observability

RAG pipelines require deep observability beyond traditional application monitoring. Key observability pillars include:

Logs: Capture all pipeline activity, including queries, retrievals, prompts, outputs, and errors.
Metrics: Track retrieval latency, recall/precision, generation faithfulness, hallucination rates, and system health.
Traces: Instrument each pipeline stage with trace IDs, enabling root cause analysis and auditability.
Drift and Quality Signals: Monitor for retrieval drift, output drift, and quality degradation over time.

Service Level Objectives (SLOs):

Define SLOs for end-to-end latency (e.g., <1s for 95% of queries), retrieval accuracy, faithfulness, and uptime.
Set up real-time alerting for SLO breaches, metric anomalies, or drift events.

Tooling:

Use platforms like Galileo, Arize AI, LangSmith, or Langfuse for integrated observability and evaluation.

The Production Readiness Checklist

Index Quality Gates

1. ✓ RAGAS eval ≥ all thresholds before first deploy

Run the full 200-question eval set 3× and assert mean + min + std_dev. No exceptions. The first deploy sets your baseline.

2. ✓ All 340 supplier contracts indexed and verified

Chunk count per supplier logged. Spot-check 10 random suppliers: manually verify that a known clause is retrievable.

3. ✓ Supplier ID filter tested for isolation

Verify that a query for supplier A returns zero results from supplier B’s contracts. Test with 5 supplier pairs.

Operational Gates

4. ✓ Delta index pipeline running and tested

Simulate a contract update: modify one clause, verify that old chunk is removed and new chunk is retrievable within the scheduled pipeline window.

5. ✓ Langfuse tracing all queries with faithfulness score

Every query logged: query text, retrieved chunk IDs and scores, generated answer, faithfulness score, latency. Zero blind spots.

6. ✓ Drift alert configured — weekly rolling average

Azure Monitor alert fires if 7-day rolling RAGAS faithfulness drops >5% vs prior week. Auto-creates an incident ticket.

Conclusion

RAG in retail isn’t just about plugging AI into existing systems — it’s about creating smarter, more responsive experiences for both customers and businesses. By blending retrieval with generation, retailers can move beyond static dashboards to deliver insights, recommendations, and support that feel timely and personal. Real advantage in helping finance teams make faster decisions, guiding customers with relevant answers, and enabling operations to adapt in real time.

RAG Architecture in Retail (Azure-based) :A Real Enterprise Use Case was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.