5 Reranking Techniques in RAG: From Fast Retrieval to Accurate Context

source: OpenAI GPT Image 2 model

You have built a RAG pipeline. You chunked your documents, picked an embedding model, and wired up a vector database. You ask a question, the retriever pulls back 50 chunks, and you stuff the top 5 into your LLM prompt. The answer is… okay. Not great. Sometimes it misses the mark entirely.

You check the retrieved chunks. The one that actually contains the answer? It is sitting at position 23. Your embedding model thought it was moderately relevant, but not top-5 relevant. So your LLM never sees it. This is not a rare edge case. It is the single most common failure mode in production RAG systems.

The problem is not your retriever. It is what happens after retrieval.

The Retrieval Bottleneck Nobody Talks About

Initial retrieval in RAG is optimized for one thing: speed. Whether you are using dense vector search, sparse BM25, or a hybrid of both, the goal is to narrow a million documents down to a manageable set in milliseconds. That speed comes at a cost. The retrieval model makes a fast, shallow judgment about relevance. It looks at the query and each document in isolation, scores them, and returns the best matches.

But “best matches” and “most useful for answering this question” are not the same thing.

Here is why initial retrieval fails:

  • Shallow scoring — Embedding models judge similarity, not usefulness. A chunk can be semantically close to the query while containing zero useful information.
  • No query-document interaction— Bi-encoders encode the query and documents separately. They never see the query and document together, so they miss nuanced relationships.
  • Context blindness— The retriever does not know what the LLM needs. It cannot judge whether a chunk will help produce a good answer.
  • False positives — Generic chunks about a topic often score higher than specific, factual chunks because they are “safer” matches.

This is where reranking comes in.

What Is Reranking?

source: OpenAI GPT Image 2 model
Reranking is a second-stage refinement in RAG pipelines. After fast initial retrieval returns a broad set of candidates, a more powerful model reorders those documents by their true relevance to the specific query. Think of it as quality control before context hits the LLM.

The pipeline looks like this:

  1. Initial retrieval: Fast, broad search returns 50–100 candidate chunks
  2. Reranking: A smarter model scores each candidate against the query and reorders them
  3. Final selection: The top 3–5 reranked chunks go to the LLM

This two-stage approach gives you the best of both worlds: the speed of approximate retrieval and the accuracy of deep relevance scoring. You do not replace your retriever. You augment it.

Retrieval vs. Reranking

Here is the key difference:

| Aspect              | Initial Retrieval        | Reranking                     |
| ------------------- | ------------------------ | ----------------------------- |
| Goal | Find candidates fast | Judge true relevance |
| Speed | Milliseconds | Tens to hundreds of milliseconds |
| Input | Query + index | Query + top-k candidates |
| Scoring depth | Shallow (embedding dot product) | Deep (cross-attention, token interaction) |
| Cost | Low (local compute) | Higher (model inference) |
| When to use | Every query | On top-k candidates only |

The insight is simple: do not spend expensive compute on a million documents. Spend it on the 50 most promising ones.

The Five Reranking Techniques

After building and debugging RAG systems across multiple projects, I have found that reranking techniques fall on a spectrum from “fast and simple” to “slow and powerful.” The right choice depends on your latency budget, accuracy requirements, and operational constraints.

Here are the five techniques that separate good RAG from great RAG.

1. Cross-Encoder Reranking

Cross-encoders are the gold standard for reranking accuracy. Unlike bi-encoders, which encode the query and document separately, a cross-encoder processes the query and document together as a single input. This allows the model to capture deep token-level interactions — pronoun resolution, negation handling, and implicit relationships that separate retrievers completely miss.

How it works:

  1. Your retriever returns 50 candidate chunks
  2. For each chunk, concatenate the query and chunk: `[CLS] query [SEP] chunk [SEP]`
  3. Feed this into a cross-encoder model (like `cross-encoder/ms-marco-MiniLM-L-6-v2`)
  4. The model outputs a relevance score
  5. Sort by score and take the top 5
from sentence_transformers import CrossEncoder

# Load a cross-encoder reranker
# ms-marco-MiniLM-L-6-v2 is fast and accurate for general use
cross_encoder = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

def rerank_with_cross_encoder(query: str, retrieved_docs: list[str], top_k: int = 5):
"""
Rerank retrieved documents using a cross-encoder.

Args:
query: The user question
retrieved_docs: List of document chunks from initial retrieval
top_k: Number of documents to return after reranking

Returns:
List of (document, score) tuples, sorted by relevance
"""
# Create query-document pairs
pairs = [[query, doc] for doc in retrieved_docs]

# Get relevance scores
scores = cross_encoder.predict(pairs)

# Combine docs with scores and sort
scored_docs = list(zip(retrieved_docs, scores))
scored_docs.sort(key=lambda x: x[1], reverse=True)

return scored_docs[:top_k]

# Example usage
query = "What are the side effects of amoxicillin?"
retrieved = [
"Amoxicillin is a penicillin antibiotic used to treat bacterial infections.",
"Common side effects include nausea, vomiting, and diarrhea.",
"The drug was first discovered in 1958 by researchers at Beecham.",
"Patients with penicillin allergies should avoid amoxicillin.",
"Side effects may include rash, itching, and in rare cases, anaphylaxis.",
]

top_docs = rerank_with_cross_encoder(query, retrieved, top_k=3)
for doc, score in top_docs:
print(f"Score: {score:.4f} | {doc}")

The verdict: Use cross-encoders when accuracy is your top priority and you can tolerate 50–200ms of additional latency. For most production RAG systems, this is the best starting point. The main limitation is that you process each query-document pair separately, so latency scales linearly with the number of candidates.

2. Reciprocal Rank Fusion (RRF)

What if you do not want to run a neural model at all? Reciprocal Rank Fusion (RRF) is a brilliantly simple technique that merges rankings from multiple retrievers using a deterministic formula. No training. No inference. Just math.

RRF shines when you are already running hybrid retrieval — say, dense vector search plus BM25 keyword search. Each retriever returns its own ranking. RRF combines them into a single, better ranking.

The formula is simple:

RRF_score(d) = sum over retrievers of 1 / (k + rank_r(d))

Where `k` is a constant (typically 60) and `rank_r(d)` is the rank of document `d` in retriever `r`. Documents that rank well across multiple retrievers get boosted to the top.

def reciprocal_rank_fusion(rankings: list[list[str]], k: int = 60) -> list[tuple[str, float]]:
"""
Merge multiple document rankings using Reciprocal Rank Fusion.

Args:
rankings: List of rankings, where each ranking is a list of document IDs
ordered from most to least relevant
k: RRF constant (default 60, as recommended in the original paper)

Returns:
List of (document_id, rrf_score) tuples, sorted by fused score
"""
scores = {}

for ranking in rankings:
for rank, doc_id in enumerate(ranking, start=1):
if doc_id not in scores:
scores[doc_id] = 0.0
# RRF formula: 1 / (k + rank)
scores[doc_id] += 1.0 / (k + rank)

# Sort by score descending
return sorted(scores.items(), key=lambda x: x[1], reverse=True)

# Example: merging BM25 and vector search results
bm25_results = ["doc_5", "doc_2", "doc_8", "doc_1", "doc_9"]
vector_results = ["doc_1", "doc_5", "doc_3", "doc_8", "doc_7"]

fused = reciprocal_rank_fusion([bm25_results, vector_results])

print("Fused ranking:")
for doc_id, score in fused:
print(f" {doc_id}: {score:.4f}")

# Notice: doc_5 and doc_1 appear in both retrievers and get boosted to the top

The verdict: RRF is the fastest technique on this list — it adds essentially zero latency. Use it when you are already running hybrid retrieval and want a quick win without adding model inference to your pipeline. It will not beat a cross-encoder on pure accuracy, but it is surprisingly effective for a zero-ML approach.

3. Cohere Rerank API

Not every team wants to host and optimize a reranking model. If you would rather focus on your product than on model ops, managed reranking APIs are a compelling option. Cohere’s Rerank API is the most mature offering in this space.

The API takes a query and a list of documents, and returns them ordered by relevance. Under the hood, Cohere runs a state-of-the-art reranking model with optimizations you would spend weeks building yourself. It handles multilingual content, long documents, and scales instantly.

import cohere
from dotenv import load_dotenv
import os

load_dotenv()

# Initialize Cohere client
co = cohere.Client(os.getenv("COHERE_API_KEY"))

def rerank_with_cohere(query: str, documents: list[str], top_k: int = 5):
"""
Rerank documents using Cohere's managed Rerank API.

Args:
query: The user question
documents: List of document chunks from initial retrieval
top_k: Number of documents to return

Returns:
List of (document, relevance_score) tuples
"""
response = co.rerank(
model="rerank-v3.5",
query=query,
documents=documents,
top_n=top_k,
return_documents=True
)

results = []
for result in response.results:
results.append((
result.document.text,
result.relevance_score
))

return results

# Example usage
query = "How do I handle authentication in a FastAPI app?"
docs = [
"FastAPI is a modern web framework for building APIs with Python.",
"To add authentication, use OAuth2PasswordBearer and JWT tokens.",
"Pydantic models in FastAPI provide automatic request validation.",
"The OAuth2PasswordBearer class expects a token URL endpoint.",
"FastAPI was created by Sebastián Ramírez and released in 2018.",
]

ranked = rerank_with_cohere(query, docs, top_k=3)
for doc, score in ranked:
print(f"Score: {score:.4f} | {doc}")

The verdict: Use Cohere Rerank when you want state-of-the-art accuracy without operational overhead. The trade-off is cost per API call and external dependency. For high-traffic applications, the latency of an external API call may also be a concern. But for most teams, the time saved on model hosting and optimization more than justifies the cost.

4. ColBERT

Cross-encoders are accurate but slow. Bi-encoders are fast but shallow. ColBERT strikes a balance between the two using a technique called “late interaction.”

Here is how it works:

  1. Offline: Each document is encoded token-by-token into contextual embeddings. These are precomputed and stored in an index.
  2. Online: At query time, the query is also encoded token-by-token.
  3. Matching: For each query token, find the most similar document token using MaxSim (maximum similarity). Sum these max similarities across all query tokens to get the final relevance score.

This is much faster than cross-encoders because document embeddings are precomputed. It is more accurate than bi-encoders because it captures token-level interactions.

from colbert import Searcher
from colbert.infra import Run, RunConfig

def setup_colbert_searcher(index_path: str, checkpoint: str):
"""
Initialize a ColBERT searcher for late-interaction reranking.

Args:
index_path: Path to the pre-built ColBERT index
checkpoint: Path to the ColBERT model checkpoint

Returns:
Configured Searcher instance
"""
with Run().context(RunConfig(nranks=1, experiment="reranking")):
searcher = Searcher(
index=index_path,
checkpoint=checkpoint
)
return searcher

def rerank_with_colbert(searcher, query: str, doc_ids: list[str], top_k: int = 5):
"""
Rerank documents using ColBERT's late interaction.

Args:
searcher: Initialized ColBERT Searcher
query: The user question
doc_ids: List of document IDs from initial retrieval
top_k: Number of documents to return

Returns:
List of (doc_id, score) tuples
"""
# Search within the candidate set
results = searcher.search(
query,
k=top_k,
filter_fn=lambda pid: pid in doc_ids # Only rerank candidates
)

return list(zip(results[0], results[2])) # doc_ids, scores

# Note: ColBERT requires a pre-built index and model checkpoint.
# For production use, build the index once and load it at startup.

The verdict: ColBERT is ideal for large-scale applications where you need better accuracy than bi-encoders but lower latency than cross-encoders. The main cost is the complexity of building and maintaining a ColBERT index. If you are already comfortable with vector databases and want to push retrieval quality without the linear latency cost of cross-encoders, ColBERT is the sweet spot.

5. LLM-as-a-Judge

For the highest-stakes domains — medical diagnosis, legal research, financial compliance — even cross-encoders may not be enough. The queries are complex, the context is nuanced, and a wrong answer is expensive. In these cases, the most accurate reranker is the LLM itself.

LLM-as-a-Judge works by prompting a large language model (GPT-4, Claude, or Gemini) to score each candidate document’s relevance to the query. Because the LLM has deep world knowledge and strong reasoning capabilities, it can judge relevance in ways that smaller reranking models cannot.

The typical prompt looks like this:

You are evaluating documents for a retrieval system.

Query: {query}
Document: {document}

Rate how relevant this document is for answering the query.
Respond with a single integer from 1 to 10, where 10 means perfectly relevant.

Relevance score:
You run this prompt for each candidate document, parse the scores, and return the top-k.
from openai import OpenAI
import os
from dotenv import load_dotenv

load_dotenv()
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

def score_document_with_llm(query: str, document: str) -> int:
"""
Ask an LLM to score a document's relevance to a query.

Args:
query: The user question
document: A candidate document chunk

Returns:
Integer relevance score from 1-10
"""
prompt = f"""You are evaluating documents for a retrieval system.

Query: {query}
Document: {document}

Rate how relevant this document is for answering the query.
Respond with a single integer from 1 to 10, where 10 means perfectly relevant.
Be strict: only give high scores to documents that directly help answer the query.

Relevance score:"""

response = client.chat.completions.create(
model="gpt-4.1-mini",
messages=[{"role": "user", "content": prompt}],
temperature=0,
max_tokens=5
)

try:
score = int(response.choices[0].message.content.strip())
return max(1, min(10, score)) # Clamp to 1-10
except ValueError:
return 5 # Default on parse failure

def rerank_with_llm_judge(query: str, documents: list[str], top_k: int = 3):
"""
Rerank documents using an LLM as a relevance judge.

Args:
query: The user question
documents: List of candidate document chunks
top_k: Number of documents to return

Returns:
List of (document, score) tuples, sorted by relevance
"""
scored = []
for doc in documents:
score = score_document_with_llm(query, doc)
scored.append((doc, score))

scored.sort(key=lambda x: x[1], reverse=True)
return scored[:top_k]

# Example usage
query = "What are the tax implications of RSU vesting for employees in California?"
docs = [
"RSUs are restricted stock units granted to employees as part of compensation.",
"In California, RSU income is taxed as ordinary income at vesting, not at grant.",
"Employers typically withhold federal and state taxes at vesting time.",
"Stock options and RSUs have different tax treatments under IRS rules.",
"California has one of the highest state income tax rates in the US.",
]

ranked = rerank_with_llm_judge(query, docs, top_k=3)
for doc, score in ranked:
print(f"Score: {score}/10 | {doc}")

The verdict: LLM-as-a-Judge is the most accurate technique on this list and also the slowest and most expensive. Use it when you are dealing with complex, high-value queries where a wrong answer costs real money or risk. For most applications, cross-encoders or managed APIs provide 90% of the accuracy at 10% of the cost. Reserve LLM judges for the cases that truly matter.

Which One Should You Use?

source: OpenAI GPT Image 2 model

There is no single best reranker. The right choice depends on your constraints. Here is a decision framework:

| Technique             | Best For                                          | Latency      | Cost           |
| --------------------- | ------------------------------------------------- | ------------ | -------------- |
| Cross-Encoder | Maximum quality on top-k candidates | 50-200ms | Local GPU/CPU |
| RRF | Hybrid retrieval without adding model inference | ~0ms | Free |
| Cohere Rerank API | Speed without operational overhead | 100-300ms | Per API call |
| ColBERT | Large-scale, low-latency use cases | 20-100ms | Index + GPU |
| LLM-as-a-Judge | Complex, high-value queries (medical, legal) | 1-5 seconds | Per API call |
My recommendation: Start with a cross-encoder. It gives you the biggest accuracy improvement for the least complexity. If latency becomes an issue, evaluate ColBERT or a managed API like Cohere. Add RRF if you are running hybrid retrieval. Reserve LLM-as-a-Judge for the specific queries where you need maximum accuracy and can afford the cost.

Most teams never need to choose just one. A common production pattern is:

  1. Hybrid retrieval (dense + BM25) with RRF fusion
  2. Cross-encoder reranking on the top 50 candidates
  3. LLM-as-a-Judge fallback for flagged high-stakes queries

This layered approach gives you speed for the common case and accuracy when it matters.

Final Thoughts

Reranking is the most underappreciated component of production RAG. Teams spend months optimizing chunk size and embedding models while ignoring the fact that their best context is buried at position 23. A good reranker fixes this with minimal engineering effort.

The techniques in this article span the full spectrum from free and fast (RRF) to expensive and powerful (LLM-as-a-Judge). The key insight is that reranking is not an all-or-nothing decision. You can layer techniques, start simple, and add complexity only when your metrics demand it.

Here are the key takeaways from this article:

  • Initial retrieval is fast but shallow — it finds candidates, not answers
  • Reranking is quality control — a second pass that judges true relevance
  • Cross-encoders are the best starting point — high accuracy, moderate complexity
  • RRF gives you free wins — if you run hybrid retrieval, fuse the rankings
  • Managed APIs save engineering time— Cohere Rerank is production-ready out of the box
  • ColBERT balances speed and accuracy — ideal for large-scale systems
  • LLM-as-a-Judge is your nuclear option— use it when accuracy matters more than cost

If you are running RAG in production and you are not reranking, you are leaving answer quality on the table. Pick one technique from this article, implement it this week, and measure the difference.

Thank you for reading this article! I hope you found it helpful. If you have any questions or feedback, please feel free to reach out to me.

#RAG #RetrievalAugmentedGeneration #LLM #MachineLearning #VectorSearch #AIEngineering #MLOps #NLP


5 Reranking Techniques in RAG: From Fast Retrieval to Accurate Context was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top