Benchmarks
Numbers, with the methodology.
hippo's retrieval, measured on LongMemEval (ICLR 2025), the public 500-question memory benchmark. The harness, the data hash, and the per-question results are in the repo so you can rerun them.
Standard task: per-question haystack
Each question ships its own haystack of about 48 conversation sessions, and the job is to retrieve the session that holds the answer. This is the standard LongMemEval-S setup, the same one published systems report. Recall at 5, on the _s split:
| Embedder | R@1 | R@5 | R@10 |
|---|---|---|---|
| MiniLM-L6 (zero-dep default) | 89.6 | 98.6 | 99.4 |
| voyage-3-large (opt-in) | 94.6 | 99.8 | 99.8 |
For reference, gbrain reports 97.6% R@5 on this split with a paid frontier embedder. hippo's free, local, zero-dependency default reaches 98.6%. On the standard task, retrieval recall is effectively saturated, so the embedder is a swappable part, not the differentiator.
Large store: one unified memory
Point retrieval at a single store of all 19,195 sessions, with no pre-scoped haystack, closer to how an agent's memory actually accumulates. Recall stops being free:
A stronger embedder helps here (47 to 56) but neither is usable on its own: the answer drowns among thousands of distractors. This is where the memory lifecycle earns its keep, by decaying, consolidating, and superseding so the effective store stays small. Measuring that is the next benchmark on the roadmap.
Tested, and local by default
Reproduce it
The data is longmemeval_s_cleaned.json (SHA-256 d6f21ea9...), 500 questions over 19,195 sessions. Retrieval is turn-level dense plus BM25, fused with reciprocal rank fusion and max-pooled to session. Embeddings are L2-normalized; the default is local MiniLM, with an opt-in pluggable provider for frontier embedders.