Multimodal RAG: Architecture, Tradeoffs, and What Actually Works in Production

This article assumes you already know what RAG is, why naive RAG breaks at scale, and what chunking, embedding, and retrieval mean. We skip the basics.

The Problem with Text-Only RAG at Scale

Standard RAG pipelines assume your knowledge base is text. That assumption holds until you’re working with enterprise data — and enterprise data is almost never purely text. It’s scanned PDFs with tables, slide decks with charts, audio meeting recordings, product images, and structured CSVs that live alongside unstructured docs. Running text-only RAG over this corpus means you’re systematically throwing away a massive chunk of information — anything that doesn’t survive an OCR pipeline or text extraction step.

The failure modes are specific:

  • A table in a PDF gets extracted as garbled text, losing the row-column structure entirely
  • A chart becomes a caption-less image that your text embedder never sees
  • A voice recording gets transcribed through ASR, losing speaker tone, emphasis, and non-speech audio
  • A scanned invoice with a logo, stamp, and handwriting is largely invisible to your retriever

Multimodal RAG doesn’t solve all of this — but it directly addresses the retrieval and grounding problems that arise when your data isn’t text-clean.

The Three Architectural Patterns

Before getting into libraries and code, let’s establish what you’re actually choosing between when you design a multimodal pipeline. There are three retrieval/fusion patterns, and they have meaningfully different tradeoffs.

Pattern 1: Extract-then-Embed (Late Fusion / Separate Indexes)

This is the most widely deployed pattern today. The idea is simple: process each modality through a specialized extractor, convert everything to text or text-adjacent representations, then embed using standard text embeddings.

Pipeline:

PDF → PyMuPDF / Unstructured → text + image captions (via GPT-4V or LLaVA)
Audio → Whisper → transcript text
Images → BLIP-2 / LLaVA → generated caption text
All outputs → text embedder (e.g. OpenAI ada-002, BGE-M3) → vector store

What works: Operationally simple. Reuses your existing text RAG infrastructure. Any text-capable LLM becomes your generator — no multimodal LLM required for generation, only for the captioning step.

What breaks: You lose information at every extraction step. A bar chart captioned as “a bar chart showing revenue trends” is not the same as the chart. Table structure collapses into linearized text. Whisper is excellent but still a lossy transcription — paralinguistic information is gone. Most critically, captioning via GPT-4V during ingestion is expensive and slow at scale.

When to use it: You have a text-dominant corpus with some embedded images/tables. Your budget doesn’t support native multimodal retrieval infrastructure. You need to ship quickly and your retrieval quality requirements are moderate.

Pattern 2: Native Multimodal Embedding (Early Fusion / Shared Vector Space)

Instead of converting everything to text, you embed each modality directly into a shared vector space using a cross-modal encoder. CLIP is the canonical example — it maps text and images into the same 512-dimensional space, enabling text queries to retrieve images directly without captioning.

Pipeline:

Images → CLIP image encoder → image vectors → vector store (index A)
Text/docs → text encoder (ada-002 or BGE) → text vectors → vector store (index B)
Query → CLIP text encoder → retrieve from both indexes → rerank → multimodal LLM

Code setup with LlamaIndex:

from llama_index.core.indices.multi_modal.base import MultiModalVectorStoreIndex
from llama_index.vector_stores.qdrant import QdrantVectorStore
from llama_index.embeddings.clip import ClipEmbedding
import qdrant_client
# Two separate stores — one for text, one for images
client = qdrant_client.QdrantClient(host="localhost", port=6333)
text_store = QdrantVectorStore(client=client, collection_name="text_docs")
image_store = QdrantVectorStore(client=client, collection_name="image_docs")
storage_context = StorageContext.from_defaults(
vector_store=text_store,
image_store=image_store
)
# CLIP embeds images; text embedder handles text chunks
index = MultiModalVectorStoreIndex.from_documents(
documents,
storage_context=storage_context,
image_embed_model=ClipEmbedding()
)

Two separate Qdrant collections — one for text chunks (1536-dim with ada-002), one for image embeddings (512-dim with CLIP). At query time, two similarity searches run in parallel and results are merged before generation.

What works: No captioning step means faster ingestion. Cross-modal retrieval is native — a text query can pull a relevant chart directly. ImageBind (Meta’s successor to CLIP) extends this to 6 modalities: text, image, audio, depth, thermal, and IMU — a single embedding space for everything.

What breaks: CLIP’s 512-dimensional embedding is relatively low-capacity. Complex visual reasoning, table understanding, and document layout are not captured well in a single vector. The cross-modal alignment is trained on natural image-text pairs — it degrades on technical diagrams, scientific plots, and domain-specific charts unless fine-tuned.

When to use it: Image-heavy corpus (product catalogs, medical scans, scientific literature with figures). You want cross-modal queries (find images using text, find similar images). Audio + image + text in one retrieval step is the use case for ImageBind.

Pattern 3: ColPali / Late Interaction — The Current Best for Document-Heavy Workloads

This is the most important architectural development in multimodal retrieval over the past year, and it’s still underused in production.

Standard bi-encoder retrieval (CLIP, ada-002) compresses a document page into a single vector. That’s a severe bottleneck — you can’t fit all the semantic content of a complex PDF page into one 512 or 1536-dimensional vector. ColPali takes a different approach: it encodes each image patch of a document page into its own vector, producing a multi-vector representation per page. At query time, similarity is computed between query token vectors and all page patch vectors using a MaxSim late interaction mechanism (borrowed from ColBERT).

The critical difference from CLIP: instead of one vector per page, you get one vector per 16×16 image patch. A full A4 page produces ~1000 patch vectors. Query scoring computes maximum similarity across all patch-query token pairs — you’re doing fine-grained patch-level matching, not page-level matching.

Architecture:

PDF page → screenshot (pdf2image) → PaliGemma-3B VLM
→ patch grid embeddings [n_patches × 128-dim]
→ stored in multi-vector index (PLAID / Qdrant with multi-vector support)
Query → text encoder → token embeddings [n_tokens × 128-dim]
→ MaxSim scoring against all page patch vectors
→ top-k pages retrieved → fed to multimodal LLM for answer generation

Code with Byaldi (the practical wrapper):

from byaldi import RAGMultiModalModel
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
# Index — no OCR, no chunking, no captioning
retriever = RAGMultiModalModel.from_pretrained("vidore/colpali-v1.3")
retriever.index(
input_path="./documents/",
index_name="contracts_idx",
store_collection_with_index=True,
overwrite=True
)
# Retrieve
results = retriever.search("termination clause with penalty", k=3)
# Returns PIL images of the top-k pages
# Generate — feed images directly to a VLM
model = Qwen2VLForConditionalGeneration.from_pretrained("Qwen/Qwen2-VL-7B-Instruct")
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct")

No OCR. No chunking pipeline. No captioning step. You screenshot the PDF pages, embed them, and retrieve. The VLM sees the actual rendered page — including table borders, font sizes, column layout, embedded figures — and reasons over it.

What works: Dramatically outperforms OCR-based pipelines on documents with complex layouts — legal contracts, medical forms, financial reports with embedded charts. The ViDoRe benchmark shows ColPali outperforming prior SOTA retrieval systems across most document domains. Ingestion pipeline is drastically simpler — the “longest part” in OCR-based pipelines (layout detection, OCR, chunking, captioning) is entirely eliminated.

What breaks: Multi-vector storage is expensive. Each page generates ~1000 128-dim vectors vs. a single 1536-dim vector in bi-encoder setups — roughly 8× more storage. At millions of documents, you need PLAID-style compressed indices or quantization. Query latency is also higher per query, though for corpus sizes under a few hundred thousand pages the overhead is on the order of milliseconds.

When to use it: Document-heavy enterprise use cases — legal, finance, medical. Any corpus where layout, tables, charts, and typography carry semantic meaning. Not a good fit for retrieval over natural images (CLIP is better there) or pure text corpora (standard text RAG beats it).

Audio — The Modality Nobody Gets Right

There are two approaches to audio in a RAG pipeline.

Transcription-first (ASR → Text RAG)

Audio → Whisper large-v3 → transcript → chunk + embed as text

This is the default and it works adequately for speech-dominant audio (meetings, interviews, podcasts). Whisper large-v3 is the current open-source ceiling — word error rates under 5% for clean English audio.

The problem is what gets thrown away: speaker identity, emotional tone, emphasis, background audio events, music. If your use case requires those, transcription pipelines are not the answer.

Native Audio Retrieval (WavRAG-style)

A 2025 ACL paper (WavRAG) introduced dual-modality encoding — embedding audio waveforms directly alongside text using a joint encoder, without an ASR intermediary. At query time, text queries can retrieve over audio embeddings directly.

In practice, native audio retrieval is not yet productized in major frameworks. The pragmatic stack is still Whisper + text RAG for most production systems. Where audio reasoning matters (not just retrieval), you’re better off feeding the audio directly to a native audio LLM (GPT-4o audio mode, Gemini 1.5 Pro) rather than trying to build a retrieval layer.

Practical recommendation: Use Whisper for ingestion, metadata-tag your chunks with speaker labels and timestamps via Whisper’s word-level alignment output, and store the original audio segments as references. When a retrieved chunk is relevant, pass the referenced audio segment directly to GPT-4o or Gemini for the final reasoning step. You get the best of both worlds — scalable text retrieval, native audio reasoning.

The Full Stack Decision Matrix

Extract-then-Embed CLIP / ImageBind ColPali Modalities All (via conversion) Image, text, audio (ImageBind) PDF/doc pages natively Retrieval quality on complex docs Poor-moderate Moderate Best in class Storage cost Low Low 8–10× higher Ingestion complexity High (OCR, captioning) Moderate Low (screenshot + embed) Infrastructure Any vector DB Qdrant, Milvus (multi-collection) Qdrant with multi-vector, or Vespa Generator requirement Any text LLM Multimodal LLM Multimodal LLM (VLM) Best for Mixed corpora, fast iteration Image-heavy corpora, cross-modal queries Legal, medical, financial documents Production maturity High High Growing — not fully productized

What the Retrieval Layer Looks Like End-to-End

A production multimodal RAG system does not use one retrieval strategy — it routes by modality. Here’s a realistic architecture:

Ingestion Router
├── .pdf → ColPali indexer → multi-vector Qdrant collection
├── .jpg/.png (standalone) → CLIP → single-vector Qdrant collection
├── .mp3/.wav → Whisper → text chunks → BGE-M3 → text Qdrant collection
├── .csv/.xlsx → schema-aware chunker → text Qdrant collection
└── .docx → Unstructured → text + embedded image extract → both paths above
Query Router (at inference time)
├── Classify query intent (text-dominant vs. visual vs. mixed)
├── Run parallel retrievals across relevant collections
├── Rerank with a cross-encoder (BGE-reranker-v2 or Cohere Rerank 3)
└── Assemble context → multimodal LLM (GPT-4o, Gemini 1.5 Pro, or Qwen2-VL)

The reranking step is non-negotiable in production. Parallel retrieval across multiple collections generates high recall but noisy results. A cross-encoder that sees the full query + retrieved chunk together — not just embedding similarity — dramatically improves precision before the generation step.

Libraries, What They’re Actually Good For, and What They Don’t Tell You

Unstructured.io — Best open-source document partitioner available. Handles PDFs, DOCX, PPTX, HTML, emails, images. Correctly identifies element types (NarrativeText, Table, Image, Title) and preserves document hierarchy. The partition_pdf() call with strategy="hi_res" runs a layout detection model (detectron2) before chunking — significantly better than PyMuPDF for complex layouts. Free tier available; enterprise API for high volume.

LlamaIndex — The right framework for document-heavy multimodal RAG. MultiModalVectorStoreIndex handles dual-index (text + image) management. Strong abstractions for MultiModalQueryEngine and MultiModalLLM. If you're building Extract-then-Embed or CLIP-based pipelines, LlamaIndex is your orchestration layer.

LangChain / LangGraph — Better for agentic workflows where the retrieval is one step in a multi-step chain. If your multimodal RAG is feeding into a tool-calling agent (retrieve → analyze → write to database → return structured output), LangGraph’s state machine approach handles that better than LlamaIndex’s pipeline abstraction. Also, a con can be version control, langchain and langgraph are really complex in that, few things works only in some specific version, that too no proper standard name for it. If you’re using external tools or rpc, version control gets very difficult.

Byaldi — The practical wrapper around ColPali. Hides the multi-vector indexing complexity, exposes a search() → PIL image interface. Not production-hardened for millions of documents, but fine for corpora under ~100K pages without custom indexing.

Qdrant — The vector DB with the best native multi-vector support, which is required for ColPali. Named vectors allow you to store both CLIP embeddings and ColPali patch embeddings per document in the same collection. Milvus has similar functionality. Chroma and Pinecone do not yet support multi-vector per document well.

CLIP (openai/clip-vit-base-patch32) — The workhorse for image-text retrieval. The base model runs on CPU. The large variant (ViT-L/14) is meaningfully better on domain-specific imagery but requires a GPU. Fine-tuning on your domain data makes a significant difference if you’re working with non-natural images (medical scans, schematics, floor plans).

ImageBind — More powerful than CLIP for cross-modal retrieval across 6 modalities, but heavier to run and less battle-tested in production RAG pipelines. Worth evaluating if audio + visual cross-modal queries are a core requirement.

Whisper large-v3 — Current open-source ASR ceiling. The word-level timestamp alignment is genuinely useful — it lets you trace retrieved text chunks back to exact audio segments. Use faster-whisper (CTranslate2-optimized) for 4× faster inference at equivalent quality.

Qwen2-VL-7B — The open-source VLM of choice for the generation step if you’re running ColPali-based retrieval locally. Strong document understanding, handles charts and tables well, fits on a single A100 in 4-bit quantization.

What Doesn’t Work Well Yet

Video is still not a first-class modality in any RAG framework. The pragmatic approach is 1 FPS frame sampling — extract one frame per second, embed with CLIP or send to a VLM. For a 10-minute video, that’s 600 images. At GPT-4o’s image token costs, that’s non-trivial per query. Dedicated video understanding models (VideoLLaMA2, CogVideoX) exist but aren’t integrated into retrieval pipelines well yet.

Audio retrieval (not just transcription) is research-grade. WavRAG showed it’s possible; productized implementations don’t exist in major frameworks.

Cross-modal hallucination is a real problem. When your multimodal LLM receives a mix of text chunks and images, it can confabulate connections between them that don’t exist in the source. Reranking helps reduce irrelevant context; structured prompting (explicitly labeling each retrieved chunk by source and modality) reduces cross-modal hallucination significantly.

Evaluating multimodal RAG is harder than text RAG. Standard metrics (BLEU, ROUGE, faithfulness scores) don’t apply cleanly to answers grounded in images or audio. ViDoRe (Visual Document Retrieval Benchmark) is the current standard for the retrieval component. For end-to-end evaluation, MRAMG-Bench (2025) covers multimodal generation quality, but running it in your own pipeline requires custom tooling.

The Decision You’ll Actually Have to Make

The choice isn’t between these three patterns in isolation — it’s about which combination covers your data distribution. In my experience building these pipelines, the practical split is:

  • Text-dominant corpus with some embedded media → LlamaIndex + Unstructured + Extract-then-Embed + GPT-4o for generation. Lowest operational overhead.
  • Mixed media with lots of visually rich PDFs → ColPali via Byaldi for document retrieval + CLIP for standalone image retrieval + Whisper for audio + Qdrant as the multi-collection store. More infrastructure, significantly better on complex documents.
  • Agentic pipeline where retrieval is one tool among many → LangGraph orchestration wrapping any of the above retrieval implementations.

The multimodal retrieval problem is not solved. ColPali is the most interesting architectural development in this space right now, and it’s still maturing. The hybrid approach — routing by modality, using the right retriever for each, combining at the generation step with a capable VLM — is currently the most defensible production architecture.

Resources


Multimodal RAG: Architecture, Tradeoffs, and What Actually Works in Production was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top