LMEB：長時程記憶嵌入基準測試

摘要

記憶嵌入技術對於記憶增強系統（如OpenClaw）至關重要，但其評估在現有文本嵌入基準測試中尚未得到充分探索。當前基準測試僅狹隘地聚焦於傳統段落檢索，未能有效評估模型處理涉及碎片化、上下文依賴性及時間跨度較長的記憶檢索任務的能力。為解決這一問題，我們推出長週期記憶嵌入基準測試（LMEB），這是一個綜合性框架，用於評估嵌入模型處理複雜長週期記憶檢索任務的能力。LMEB涵蓋22個數據集和193個零樣本檢索任務，包含情景記憶、對話記憶、語義記憶和程序性記憶四種類型，並同時採用AI生成數據與人工標註數據。這些記憶類型在抽象層級和時間依賴性上存在差異，捕捉了反映現實世界多樣化挑戰的記憶檢索特徵。我們評估了15個廣泛使用的嵌入模型，參數量從數億到百億級不等。結果表明：（1）LMEB具備合理的難度層級；（2）模型規模與性能並非正相關；（3）LMEB與MTEB存在正交性。這說明學界尚未收斂出能勝任所有記憶檢索任務的通用模型，且傳統段落檢索的性能可能無法遷移至長週期記憶檢索場景。總而言之，通過提供標準化、可重現的評估框架，LMEB填補了記憶嵌入評估的關鍵空白，將推動文本嵌入技術在處理長期上下文依賴型記憶檢索方面的進一步發展。LMEB已開源於：https://github.com/KaLM-Embedding/LMEB。

English

Memory embeddings are crucial for memory-augmented systems, such as OpenClaw, but their evaluation is underexplored in current text embedding benchmarks, which narrowly focus on traditional passage retrieval and fail to assess models' ability to handle long-horizon memory retrieval tasks involving fragmented, context-dependent, and temporally distant information. To address this, we introduce the Long-horizon Memory Embedding Benchmark (LMEB), a comprehensive framework that evaluates embedding models' capabilities in handling complex, long-horizon memory retrieval tasks. LMEB spans 22 datasets and 193 zero-shot retrieval tasks across 4 memory types: episodic, dialogue, semantic, and procedural, with both AI-generated and human-annotated data. These memory types differ in terms of level of abstraction and temporal dependency, capturing distinct aspects of memory retrieval that reflect the diverse challenges of the real world. We evaluate 15 widely used embedding models, ranging from hundreds of millions to ten billion parameters. The results reveal that (1) LMEB provides a reasonable level of difficulty; (2) Larger models do not always perform better; (3) LMEB and MTEB exhibit orthogonality. This suggests that the field has yet to converge on a universal model capable of excelling across all memory retrieval tasks, and that performance in traditional passage retrieval may not generalize to long-horizon memory retrieval. In summary, by providing a standardized and reproducible evaluation framework, LMEB fills a crucial gap in memory embedding evaluation, driving further advancements in text embedding for handling long-term, context-dependent memory retrieval. LMEB is available at https://github.com/KaLM-Embedding/LMEB.

LMEB：長時程記憶嵌入基準測試

LMEB: Long-horizon Memory Embedding Benchmark

摘要

Support