We tried vectors, ASTs, and brute-force context stuffing for code retrieval. Graphs with LLM-generated semantics worked best. Here’s what we learned.

We spent the last year building a code indexing system and tried pretty much every retrieval approach along the way. Sharing what actually worked and what didn't because the discourse around "just use embeddings" or "just use Tree-sitter" is missing some nuance.

The core problem
AI coding tools re-read your entire repo every session. Context windows are getting bigger but that just means you're spending more tokens saying the same thing over and over. We wanted persistent structured memory for codebases.

What we tried and what broke:

Vector embeddings on code chunks. Sounds obvious, doesn't work well in practice. A function called `process()` in a payments service and `process()` in an image pipeline embed to similar vectors because the tokens are similar. But they have nothing to do with each other. Vectors flatten call graphs, inheritance, imports — all the structural relationships that make code make sense. Retrieval precision was bad enough that we abandoned this entirely.

Tree-sitter AST parsing. Precise and fast. But it only gives you structure, not meaning. It can tell you a function exists and what it calls. It can't tell you "this function handles webhook retries for failed Stripe payments." For answering developer questions that are phrased in business language, pure AST falls short.

What actually worked: per-file LLM analysis that generates a purpose, summary, and business context for every file, stored as nodes in a Neo4j graph with edges to classes, functions, keywords, and imports. Then fulltext search across those semantic fields instead of vector similarity.

This tracks what recent papers are finding. RepoGraph (ICLR 2025) showed +32.8% on SWE-bench with graph approaches. Code-Craft showed +82% top-1 retrieval precision with bottom-up LLM summaries from code graphs.

The tradeoff is indexing cost. You're making an LLM call per file. We use SHA-256 diffing so reindexing only hits changed files, which makes it manageable, but the upfront cost is real.

We open sourced the system if anyone wants to look at the implementation or poke holes in it: github.com/ByteBell/bytebell-oss

Genuinely curious what others are seeing. Anyone running GitNexus, Sourcegraph, or Augment on large repos? How does retrieval quality feel compared to brute force context stuffing? Is anyone getting good results with embeddings on code that I'm missing?What's actually different from the obvious neighbours

I know "code knowledge graph" is a crowded space now, so here's the honest side-by-side from our comparison.md:

Bytebell PageIndex GitNexus GraphRAG Sourcegraph/Cody Augment
Domain Code Long PDFs/docs Code General text Code Code
Deployment Local Bun daemon, BYO infra Python lib + cloud Browser/WASM or CLI Python lib Self-hosted or SaaS SaaS
Indexing Per-file LLM → purpose + summary + businessContext + entities TOC reasoning tree Tree-sitter AST + community detection Per-chunk LLM entities + community clustering LSIF/SCIP search index Proprietary semantic index
Storage Neo4j + MongoDB TOC tree, no vector DB LadybugDB (ex-Kuzu) Parquet/GraphML Proprietary Proprietary cloud
Per-node semantics purpose / summary / businessContext None (structural) Optional per-symbol Community text summaries None (search-based) Embeddings
Diff-aware reindex Per-file SHA-256, LLM cost ∝ churn Full re-parse Git-diff hook Full re-extract Incremental Continuous (managed)
Outbound calls OpenRouter only, binds 127.0.0.1 Cloud agent option None LLM provider only Code snippets → cloud LLM Source code → SaaS
Vectors? No — graph + fulltext only No Optional Yes (community) No Yes

The short version of the differentiation:

  • vs PageIndex — PageIndex is brilliant for PDFs and filings, not code. No code graph, no MCP server, no class/function/import edges.
  • vs GitNexus — closest neighbour. Both are MCP-native code graphs. GitNexus uses Tree-sitter ASTs (precise but language-bound and structure-only); Bytebell uses LLM-generated per-file narrative semantics (purpose / summary / businessContext) so retrieval matches what a developer means, not just what the code spells. We index the vocabulary gap; they index the syntax.
  • vs Microsoft GraphRAG — GraphRAG is for narrative text. Per-chunk entity extraction + community clustering doesn't map cleanly onto code where a "community" is a service or a module. We borrowed the graph idea and rebuilt the schema for code.
  • vs Sourcegraph — Sourcegraph is a search index with LSIF/SCIP. Powerful, but no per-node semantic enrichment, no MCP-native retrieval surface, and the cloud version sends snippets out. Bytebell is single-tenant, local, with structured semantics on every node.
  • vs Augment — Augment is a great SaaS context engine, but your source code goes to their servers. Bytebell binds to 127.0.0.1 and the only outbound call is to OpenRouter for the per-file analysis (which you can route to a local model if you want).

The retrieval shape

Three MCP tools — that's it:

  1. smart_search(query) — fused six-channel search across file purpose/summary, businessContext, paths, keyword names, class/function signatures, module imports. Returns ranked top-K with folder clustering.
  2. keyword_lookup(term) — reverse: term → all matching named entities (keywords, classes, functions, modules) → files.
  3. retrieve_filemetadata (purpose + class/function line ranges), content (specific line ranges or in-file search with surrounding context), bulk_search (parallel scan up to 50 files).

Most well-formed code questions resolve in 2–4 tool calls. No re-clone, no full-file dumps, no embeddings round-trip.

Quickstart

# prereqs: Bun ≥1.1, Docker, OpenRouter key bytebell set openrouter-api-key sk-or-... bytebell set openrouter-model anthropic/claude-sonnet-4.6 bytebell boot # spins up Mongo+Neo4j+Redis locally bytebell index https://github.com/anthropics/claude-code # Claude Code claude mcp add --transport http bytebell http://127.0.0.1:8080/mcp 

Research grounding (for the curious)

The shape — graph at ingest + LLM-derived per-node semantics + MCP retrieval — tracks recent work that purely structural (AST/call-graph) and purely semantic (embeddings) retrieval each leave a lot on the table:

  • RepoGraph (ICLR 2025) — +32.8% on SWE-bench
  • CodexGraph (NAACL 2025) — graph-DB queries beat similarity-only retrieval
  • Code-Craft (2504.08975) — bottom-up LLM summaries from a code graph, +82% top-1 retrieval precision
  • Hierarchical Repo-Level Code Summarization (ICSE LLM4Code 2025) — closest motivational match for our purpose/summary/businessContext schema

Full reading list and links in the README.

What this is NOT

  • Not a hosted product. Not a chat UI. Not multi-tenant. There is exactly one tenant (orgId="local") and the server binds to 127.0.0.1.
  • Not a Sourcegraph replacement at enterprise scale today — no cross-file call edges yet (deliberate tradeoff for language-agnostic ingest; future strategies will add them).
  • Not free for commercial use. AGPL-3.0 + non-commercial clause. If you want to run this in a for-profit product, that's a separate conversation (team@bytebell.ai).

Repo: https://github.com/ByteBell/bytebell-oss

Honest feedback, issues, and "you're wrong about X" are all welcome — especially from anyone using GitNexus, Sourcegraph, or Augment in anger. We've tried to be fair in the comparison table but happy to be corrected.

submitted by /u/graphicaldot
[link] [comments]

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top