I'm working on an open-source memory infrastructure for AI agents (CtxVault). It organizes agent memory into typed, isolated vaults rather than a single shared vector store.
I've run standard retrieval benchmarks (BEIR, CoIR) comparing against raw ChromaDB and LangChain and confirmed the vault abstraction adds no retrieval overhead. That part is straightforward.
The part I'm stuck on is how to benchmark the properties that actually differentiate the system. There are two main claims I want to evaluate:
First, context isolation. When multiple agents have separate memory spaces with semantically similar content (e.g. three agents working in the same domain but for different clients), I want to measure context pollution: does information from agent A's memory leak into agent B's results? With metadata filtering on a single index, contamination is technically 0% if the filter is applied correctly, same as with physically separate indexes. The real difference is architectural (how many code paths can silently break the guarantee), which doesn't translate to a retrieval metric. I'm looking for a way to measure this that goes beyond just "contamination rate = 0 for everyone."
Second, typed memory. CtxVault separates knowledge (semantic vaults) from skills/procedures (skill vaults), following the CoALA taxonomy. I want to measure whether this separation actually improves retrieval quality vs dumping everything in a single index. I could measure "type confusion rate" (how often a knowledge query returns a skill or vice versa) but that feels like it obviously favors the typed approach by construction.
There are also more memory types coming (episodic, graph-backed semantic) so ideally the evaluation framework would be extensible to new types rather than hardcoded for the current two.
I've been looking at adapting LongMemEval or LoCoMo with a multi-agent twist (mapping separate speakers to separate vaults and testing cross-contamination under ambiguous queries) but haven't found a clean setup yet.
Has anyone dealt with benchmarking architectural properties of memory systems rather than just retrieval quality? Interested in both methodology and pointers to relevant papers. The goal is something with scientific validity that could go into a paper, not just internal testing.
[link] [comments]