LMEB: 長期的メモリ埋め込みベンチマーク

要旨

メモリ埋め込みは、OpenClawのようなメモリ拡張システムにおいて重要であるが、その評価は現在のテキスト埋め込みベンチマークでは十分に検討されていない。既存のベンチマークは従来の文書検索に焦点を狭く当てており、断片的で文脈依存性が高く時間的に隔たった情報を含む長期的なメモリ検索タスクをモデルが処理する能力を評価できていない。この問題に対処するため、我々は長期的メモリ埋め込みベンチマーク（LMEB）を提案する。これは、複雑で長期的なメモリ検索タスクを処理する埋め込みモデルの能力を評価する包括的フレームワークである。LMEBは22のデータセットと193のゼロショット検索タスクを網羅し、エピソード記憶、対話記憶、意味記憶、手続き記憶の4つのメモリタイプに分類される。これらにはAI生成データと人手注釈データの両方が含まれる。これらのメモリタイプは抽象度と時間的依存性が異なり、現実世界の多様な課題を反映したメモリ検索の異なる側面を捉えている。我々は数億から百億パラメータ規模の15の広く使用されている埋め込みモデルを評価した。結果は以下のことを示している：（1）LMEBは適切な難易度を提供する（2）大規模モデルが常に優れているわけではない（3）LMEBとMTEBは直交性を示す。これは、あらゆるメモリ検索タスクで卓越した性能を発揮する普遍的なモデルが分野としてまだ確立されていないこと、および従来の文書検索での性能が長期的メモリ検索に一般化しない可能性を示唆している。要約すると、標準化され再現性のある評価フレームワークを提供することで、LMEBはメモリ埋め込み評価における重要なギャップを埋め、長期的で文脈依存的なメモリ検索を扱うテキスト埋め込み技術のさらなる進展を推進する。LMEBはhttps://github.com/KaLM-Embedding/LMEBで公開されている。

English

Memory embeddings are crucial for memory-augmented systems, such as OpenClaw, but their evaluation is underexplored in current text embedding benchmarks, which narrowly focus on traditional passage retrieval and fail to assess models' ability to handle long-horizon memory retrieval tasks involving fragmented, context-dependent, and temporally distant information. To address this, we introduce the Long-horizon Memory Embedding Benchmark (LMEB), a comprehensive framework that evaluates embedding models' capabilities in handling complex, long-horizon memory retrieval tasks. LMEB spans 22 datasets and 193 zero-shot retrieval tasks across 4 memory types: episodic, dialogue, semantic, and procedural, with both AI-generated and human-annotated data. These memory types differ in terms of level of abstraction and temporal dependency, capturing distinct aspects of memory retrieval that reflect the diverse challenges of the real world. We evaluate 15 widely used embedding models, ranging from hundreds of millions to ten billion parameters. The results reveal that (1) LMEB provides a reasonable level of difficulty; (2) Larger models do not always perform better; (3) LMEB and MTEB exhibit orthogonality. This suggests that the field has yet to converge on a universal model capable of excelling across all memory retrieval tasks, and that performance in traditional passage retrieval may not generalize to long-horizon memory retrieval. In summary, by providing a standardized and reproducible evaluation framework, LMEB fills a crucial gap in memory embedding evaluation, driving further advancements in text embedding for handling long-term, context-dependent memory retrieval. LMEB is available at https://github.com/KaLM-Embedding/LMEB.

LMEB: 長期的メモリ埋め込みベンチマーク

LMEB: Long-horizon Memory Embedding Benchmark

要旨

Support