LMEB：长时程记忆嵌入基准测试

摘要

记忆嵌入对于记忆增强系统（如OpenClaw）至关重要，但当前文本嵌入基准对其评估研究不足。这些基准仅聚焦于传统段落检索，未能评估模型处理涉及碎片化、上下文依赖且时间跨度较长的长程记忆检索任务的能力。为此，我们推出长程记忆嵌入基准（LMEB），这是一个综合评估嵌入模型处理复杂长程记忆检索任务能力的框架。LMEB涵盖22个数据集和193个零样本检索任务，包含情景记忆、对话记忆、语义记忆和程序性记忆4种记忆类型，数据来源兼有人工标注与AI生成。这些记忆类型在抽象程度和时间依赖性上存在差异，从不同维度反映了现实世界中记忆检索的多样性挑战。我们对15个参数量从数亿到百亿不等的常用嵌入模型进行评估，结果表明：（1）LMEB具备合理的难度梯度；（2）模型规模与性能非正相关；（3）LMEB与MTEB存在正交性。这说明目前尚未出现能通用于所有记忆检索任务的通用模型，且传统段落检索性能无法直接迁移至长程记忆检索场景。综上所述，LMEB通过提供标准化、可复现的评估框架，填补了记忆嵌入评估的关键空白，推动了面向长期上下文依赖记忆检索的文本嵌入技术发展。LMEB已开源于https://github.com/KaLM-Embedding/LMEB。

English

Memory embeddings are crucial for memory-augmented systems, such as OpenClaw, but their evaluation is underexplored in current text embedding benchmarks, which narrowly focus on traditional passage retrieval and fail to assess models' ability to handle long-horizon memory retrieval tasks involving fragmented, context-dependent, and temporally distant information. To address this, we introduce the Long-horizon Memory Embedding Benchmark (LMEB), a comprehensive framework that evaluates embedding models' capabilities in handling complex, long-horizon memory retrieval tasks. LMEB spans 22 datasets and 193 zero-shot retrieval tasks across 4 memory types: episodic, dialogue, semantic, and procedural, with both AI-generated and human-annotated data. These memory types differ in terms of level of abstraction and temporal dependency, capturing distinct aspects of memory retrieval that reflect the diverse challenges of the real world. We evaluate 15 widely used embedding models, ranging from hundreds of millions to ten billion parameters. The results reveal that (1) LMEB provides a reasonable level of difficulty; (2) Larger models do not always perform better; (3) LMEB and MTEB exhibit orthogonality. This suggests that the field has yet to converge on a universal model capable of excelling across all memory retrieval tasks, and that performance in traditional passage retrieval may not generalize to long-horizon memory retrieval. In summary, by providing a standardized and reproducible evaluation framework, LMEB fills a crucial gap in memory embedding evaluation, driving further advancements in text embedding for handling long-term, context-dependent memory retrieval. LMEB is available at https://github.com/KaLM-Embedding/LMEB.