On the LoCoMo benchmark, Memvid sets a new state of the art with a score of 85.7%, outperforming every existing memory system by 35% on long-horizon conversational recall and reasoning.
Categories 1-4 Accuracy
Benchmark: LoCoMo · 10 conversations · ~26K tokens each · LLM-as-Judge evaluation
Source: Baseline results from arXiv:2504.19413. Some vendors dispute these figures; see paper for methodology.
| Category | Memvid | Mem0 | Mem0ᵍ | Zep | OpenAI | Δ vs Avg |
|---|---|---|---|---|---|---|
| Single-hop | 80.1% | 67.1% | 65.7% | 61.7% | 63.8% | +24% |
| Multi-hop | 80.4% | 51.1% | 47.2% | 41.4% | 42.9% | +76% |
| Temporal | 71.9% | 55.5% | 58.1% | 49.3% | 21.7% | +56% |
| World-knowledge | 91.1% | 72.9% | 75.7% | 76.6% | 62.3% | +27% |
| Adversarial | 77.8% | — | — | — | — | — |
| Overall (Cat. 1-4) | 85.65% | 66.88% | 68.44% | 65.99% | 52.90% | +35% |
Following standard methodology, adversarial category is excluded from the primary metric.
Baseline figures sourced from arXiv:2504.19413. Results for some systems are disputed by their vendors.
Open source benchmark suite
Our benchmark implementation is fully open source. Run the complete evaluation suite yourself and verify the results.
bun run bench:full