Memory benchmarks
Lobu’s memory system is benchmarked against external memory systems (Mem0, Supermemory, Letta, Zep) on public datasets. This page summarises the headline numbers and points at the reproducible harness.
Headline results
Section titled “Headline results”Same answerer (glm-5.1 via z.ai), same top-K, same questions, three trials per public configuration.
LongMemEval (oracle-50)
Section titled “LongMemEval (oracle-50)”Single-session knowledge retention.
| System | Overall | Answer | Retrieval | Latency |
|---|---|---|---|---|
| Lobu | 87.1% | 78.0% | 100.0% | 237ms |
| Supermemory | 69.1% | 56.0% | 96.6% | 702ms |
| Mem0 | 65.7% | 54.0% | 85.3% | 753ms |
LoCoMo-50
Section titled “LoCoMo-50”Multi-session conversational memory (each scenario is ~19 sessions of 18+ turns, then a question grounded in the dialogue).
| System | Overall | Answer | Retrieval | Latency |
|---|---|---|---|---|
| Lobu | 57.8% | 38.0% | 79.5% | 121ms |
| Mem0 | 41.5% | 28.0% | 66.9% | 606ms |
| Supermemory | 23.2% | 14.0% | 36.5% | 532ms |
Methodology guardrails
Section titled “Methodology guardrails”The harness applies the following fairness constraints:
- Per-scenario isolation — every scenario runs in a fresh system state. Providers do not search across earlier scenarios from the same run.
- Multi-trial public runs — public full-QA configs default to three trials so reports show run-to-run variability.
- Uniform top-K — every adapter asks for exactly the configured
topK. No silent overfetch. - Per-system answerer token totals — leaderboards include answerer-side prompt and completion tokens so LLM cost is visible alongside accuracy.
- Parallel system execution — compare configs run systems in parallel (
Promise.allSettled); one provider’s failure does not abort the others. - Async ingest is waited out — for providers that index asynchronously (Zep’s
/graph-batch), the adapter polls until the server reports the ingest processed. - Raw metrics first — treat answer accuracy, retrieval recall, and citation quality as the primary comparison. The reported “overall” number is a secondary house score.
Latency caveat
Section titled “Latency caveat”Latency is retrieval-only latency, not end-to-end wall clock. It is not fully apples-to-apples when one system is local/in-process and another is a hosted API. Lobu’s retrieval path is a multi-step plan (query expansion, entity search, content search, linked-context fetches) — that orchestration is what gets it to 100% retrieval recall on LongMemEval but also costs round trips. Mem0 and Supermemory adapters issue a single provider search per question.
Reproducing the results
Section titled “Reproducing the results”The full harness lives in the owletto repo under benchmarks/memory/. The TypeScript runner is at src/benchmarks/memory/. External systems are integrated as long-lived Python adapter subprocesses framed over JSONL-on-stdin, which avoids per-op fork/exec cost.
Prerequisites
Section titled “Prerequisites”- Node.js 20+, pnpm 9+, Docker
ZAI_API_KEY(z.ai, used as the answerer modelglm-5.1)- API keys for any external systems you want to include:
MEM0_API_KEY,SUPERMEMORY_API_KEY,LETTA_API_KEY,ZEP_API_KEY
LongMemEval oracle-50, all systems
Section titled “LongMemEval oracle-50, all systems”ZAI_API_KEY=... MEM0_API_KEY=... SUPERMEMORY_API_KEY=... LETTA_API_KEY=... \ pnpm benchmark:memory --config benchmarks/memory/config.longmemeval.oracle.50.compare.all.zai.jsonLoCoMo-50, three-way (Lobu vs Mem0 vs Supermemory)
Section titled “LoCoMo-50, three-way (Lobu vs Mem0 vs Supermemory)”ZAI_API_KEY=... MEM0_API_KEY=... SUPERMEMORY_API_KEY=... \ pnpm benchmark:memory --config benchmarks/memory/config.locomo.50.compare.top-memory.zai.jsonLobu-only, no external API keys
Section titled “Lobu-only, no external API keys”# Retrieval-only (no answerer)pnpm benchmark:memory --config benchmarks/memory/config.longmemeval.oracle.50.json
# Full QA with z.ai answererZAI_API_KEY=... pnpm benchmark:memory --config benchmarks/memory/config.longmemeval.oracle.50.zai.jsonZAI_API_KEY=... pnpm benchmark:memory --config benchmarks/memory/config.locomo.50.zai.jsonSmaller LoCoMo slices
Section titled “Smaller LoCoMo slices”pnpm benchmark:memory --config benchmarks/memory/config.locomo.5.local.jsonpnpm benchmark:memory --config benchmarks/memory/config.locomo.10.compare.top-memory.zai.jsonpnpm benchmark:memory --config benchmarks/memory/config.locomo.30.local.jsonA complete table of available configs is documented in benchmarks/memory/README.md.
GitHub Actions
Section titled “GitHub Actions”The Memory Benchmark workflow runs the same harness in CI and uploads JSON + Markdown artifacts.
- Workflow:
benchmark-memory.yml - Trigger: Actions → Memory Benchmark → Run workflow
Inputs include dataset (longmemeval-oracle or locomo), limit, trials, model (answerer model id), and providers (comma-separated adapter list).
Adapters
Section titled “Adapters”| System | Adapter | Notes |
|---|---|---|
| Mem0 | adapters/mem0_adapter.py | MEM0_API_KEY |
| Supermemory | adapters/supermemory_adapter.py | SUPERMEMORY_API_KEY |
| Letta | adapters/letta_adapter.py | LETTA_API_KEY |
| Zep | adapters/zep_adapter.py | ZEP_API_KEY (Cloud) or ZEP_BASE_URL (self-hosted) |
To add a new system, write a Python adapter that defines reset / setup / ingest / retrieve action handlers. The shared protocol module is at adapters/_bench_protocol.py.
Why Lobu wins on retention
Section titled “Why Lobu wins on retention”Lobu blends three signals for recall:
- Entity name matching
- Full-text search
- Semantic vector search
Plus structured retrieval — Lobu stores knowledge in entity types backed by JSON Schema, with first-class relationships and superseding writes. That is why it reaches 100% retrieval on LongMemEval where vector-only systems plateau in the 80–90% range.