Benchmarks

Numbers, with the methodology.

hippo's retrieval, measured on LongMemEval (opens in new tab) (ICLR 2025), the public 500-question memory benchmark. The harness, the data hash, and the per-question results are in the repo so you can rerun them.

Standard task: per-question haystack

Each question ships its own haystack of about 48 conversation sessions, and the job is to retrieve the session that holds the answer. This is the standard LongMemEval-S setup, the same one published systems report. Recall at 5, on the _s split:

Embedder	R@1	R@5	R@10
MiniLM-L6 (zero-dep default)	89.6	98.6	99.4
voyage-3-large (opt-in)	94.6	99.8	99.8

For reference, gbrain reports 97.6% R@5 on this split with a paid frontier embedder. hippo's free, local, zero-dependency default reaches 98.6%. On the standard task, retrieval recall is effectively saturated, so the embedder is a swappable part, not the differentiator.

Large store: one unified memory

Point retrieval at a single store of all 19,195 sessions, with no pre-scoped haystack, closer to how an agent's memory actually accumulates. Recall stops being free:

47.2 R@5 · MiniLM-L6 (zero-dep default)

56.4 R@5 · voyage-3-large (opt-in)

A stronger embedder helps here (47 to 56) but neither is usable on its own: the answer drowns among thousands of distractors. This is where the memory lifecycle earns its keep, by decaying, consolidating, and superseding so the effective store stays small. Measuring that is the next benchmark on the roadmap.

Tested, and local by default

74% R@5, BM25 only Zero dependencies, no embeddings required at runtime.

926 tests, real database Zero mocks. Project rule: no mocked dependencies.

0 outbound HTTP Local SQLite, proven by a fetch spy on the ingestion smoke.

Reproduce it

The data is longmemeval_s_cleaned.json (SHA-256 d6f21ea9...), 500 questions over 19,195 sessions. Retrieval is turn-level dense plus BM25, fused with reciprocal rank fusion and max-pooled to session. Embeddings are L2-normalized; the default is local MiniLM, with an opt-in pluggable provider for frontier embedders.

Benchmark harness & data Repository