We spent the last year building a code indexing system and tried pretty much every retrieval approach along the way. Sharing what actually worked and what didn't because the discourse around "just use embeddings" or "just use Tree-sitter" is missing some nuance.

The core problem
AI coding tools re-read your entire repo every session. Context windows are getting bigger but that just means you're spending more tokens saying the same thing over and over. We wanted persistent structured memory for codebases.

What we tried and what broke:

Vector embeddings on code chunks. Sounds obvious, doesn't work well in practice. A function called `process()` in a payments service and `process()` in an image pipeline embed to similar vectors because the tokens are similar. But they have nothing to do with each other. Vectors flatten call graphs, inheritance, imports — all the structural relationships that make code make sense. Retrieval precision was bad enough that we abandoned this entirely.

Tree-sitter AST parsing. Precise and fast. But it only gives you structure, not meaning. It can tell you a function exists and what it calls. It can't tell you "this function handles webhook retries for failed Stripe payments." For answering developer questions that are phrased in business language, pure AST falls short.

What actually worked: per-file LLM analysis that generates a purpose, summary, and business context for every file, stored as nodes in a Neo4j graph with edges to classes, functions, keywords, and imports. Then fulltext search across those semantic fields instead of vector similarity.

This tracks what recent papers are finding. RepoGraph (ICLR 2025) showed +32.8% on SWE-bench with graph approaches. Code-Craft showed +82% top-1 retrieval precision with bottom-up LLM summaries from code graphs.

The tradeoff is indexing cost. You're making an LLM call per file. We use SHA-256 diffing so reindexing only hits changed files, which makes it manageable, but the upfront cost is real.

We open sourced the system if anyone wants to look at the implementation or poke holes in it: github.com/ByteBell/bytebell-oss

Genuinely curious what others are seeing. Anyone running GitNexus, Sourcegraph, or Augment on large repos? How does retrieval quality feel compared to brute force context stuffing? Is anyone getting good results with embeddings on code that I'm missing?What's actually different from the obvious neighbours

I know "code knowledge graph" is a crowded space now, so here's the honest side-by-side from our comparison.md:

	Bytebell	PageIndex	GitNexus	GraphRAG	Sourcegraph/Cody	Augment
Domain	Code	Long PDFs/docs	Code	General text	Code	Code
Deployment	Local Bun daemon, BYO infra	Python lib + cloud	Browser/WASM or CLI	Python lib	Self-hosted or SaaS	SaaS
Indexing	Per-file LLM → purpose + summary + businessContext + entities	TOC reasoning tree	Tree-sitter AST + community detection	Per-chunk LLM entities + community clustering	LSIF/SCIP search index	Proprietary semantic index
Storage	Neo4j + MongoDB	TOC tree, no vector DB	LadybugDB (ex-Kuzu)	Parquet/GraphML	Proprietary	Proprietary cloud
Per-node semantics	`purpose` / `summary` / `businessContext`	None (structural)	Optional per-symbol	Community text summaries	None (search-based)	Embeddings
Diff-aware reindex	Per-file SHA-256, LLM cost ∝ churn	Full re-parse	Git-diff hook	Full re-extract	Incremental	Continuous (managed)
Outbound calls	OpenRouter only, binds 127.0.0.1	Cloud agent option	None	LLM provider only	Code snippets → cloud LLM	Source code → SaaS
Vectors?	No — graph + fulltext only	No	Optional	Yes (community)	No	Yes

The short version of the differentiation:

vs PageIndex — PageIndex is brilliant for PDFs and filings, not code. No code graph, no MCP server, no class/function/import edges.
vs GitNexus — closest neighbour. Both are MCP-native code graphs. GitNexus uses Tree-sitter ASTs (precise but language-bound and structure-only); Bytebell uses LLM-generated per-file narrative semantics (purpose / summary / businessContext) so retrieval matches what a developer means, not just what the code spells. We index the vocabulary gap; they index the syntax.
vs Microsoft GraphRAG — GraphRAG is for narrative text. Per-chunk entity extraction + community clustering doesn't map cleanly onto code where a "community" is a service or a module. We borrowed the graph idea and rebuilt the schema for code.
vs Sourcegraph — Sourcegraph is a search index with LSIF/SCIP. Powerful, but no per-node semantic enrichment, no MCP-native retrieval surface, and the cloud version sends snippets out. Bytebell is single-tenant, local, with structured semantics on every node.
vs Augment — Augment is a great SaaS context engine, but your source code goes to their servers. Bytebell binds to 127.0.0.1 and the only outbound call is to OpenRouter for the per-file analysis (which you can route to a local model if you want).

The retrieval shape

Three MCP tools — that's it:

smart_search(query) — fused six-channel search across file purpose/summary, businessContext, paths, keyword names, class/function signatures, module imports. Returns ranked top-K with folder clustering.
keyword_lookup(term) — reverse: term → all matching named entities (keywords, classes, functions, modules) → files.
retrieve_file — metadata (purpose + class/function line ranges), content (specific line ranges or in-file search with surrounding context), bulk_search (parallel scan up to 50 files).

Most well-formed code questions resolve in 2–4 tool calls. No re-clone, no full-file dumps, no embeddings round-trip.

Quickstart

# prereqs: Bun ≥1.1, Docker, OpenRouter key bytebell set openrouter-api-key sk-or-... bytebell set openrouter-model anthropic/claude-sonnet-4.6 bytebell boot # spins up Mongo+Neo4j+Redis locally bytebell index https://github.com/anthropics/claude-code # Claude Code claude mcp add --transport http bytebell http://127.0.0.1:8080/mcp

Research grounding (for the curious)

The shape — graph at ingest + LLM-derived per-node semantics + MCP retrieval — tracks recent work that purely structural (AST/call-graph) and purely semantic (embeddings) retrieval each leave a lot on the table:

RepoGraph (ICLR 2025) — +32.8% on SWE-bench
CodexGraph (NAACL 2025) — graph-DB queries beat similarity-only retrieval
Code-Craft (2504.08975) — bottom-up LLM summaries from a code graph, +82% top-1 retrieval precision
Hierarchical Repo-Level Code Summarization (ICSE LLM4Code 2025) — closest motivational match for our purpose/summary/businessContext schema

Full reading list and links in the README.

What this is NOT

Not a hosted product. Not a chat UI. Not multi-tenant. There is exactly one tenant (orgId="local") and the server binds to 127.0.0.1.
Not a Sourcegraph replacement at enterprise scale today — no cross-file call edges yet (deliberate tradeoff for language-agnostic ingest; future strategies will add them).
Not free for commercial use. AGPL-3.0 + non-commercial clause. If you want to run this in a for-profit product, that's a separate conversation (team@bytebell.ai).

Repo: https://github.com/ByteBell/bytebell-oss

Honest feedback, issues, and "you're wrong about X" are all welcome — especially from anyone using GitNexus, Sourcegraph, or Augment in anger. We've tried to be fair in the comparison table but happy to be corrected.

submitted by /u/graphicaldot
[link] [comments]

We tried vectors, ASTs, and brute-force context stuffing for code retrieval. Graphs with LLM-generated semantics worked best. Here’s what we learned.

The retrieval shape

Quickstart

Research grounding (for the curious)

What this is NOT

Leave a Comment