SubtleMemory: 長期行動AIエージェントにおける細粒度の関係記憶識別のためのベンチマーク

要旨

永続的AIアシスタント（例：OpenClaw）は、長期的な対話を通じて関連記憶の大規模なコレクションを蓄積する。記憶が増大するにつれて、それらは互いに強化し合い、文脈に応じて分岐したり、直接矛盾したりする可能性があり、正確な支援は孤立した想起ではなく記憶間の関係に依存するようになる。既存の長期記憶ベンチマークは、エージェントが下流タスクにおいてそのような関係を保存・活用する方法をほとんど調査していない。このギャップに対処するため、我々はSubtleMemoryを提案する。これは、長期稼働するAIエージェントにおける細粒度の関係記憶識別のためのベンチマークである。SubtleMemoryは、バリアントが補完的、微細、または矛盾した関係を具体化する関係制御された潜在意味アーティファクトを構築し、それらを現実的なユーザー・エージェント履歴に埋め込むことで、エージェントが後のクエリや指示において分散された関係構造を復元することを要求する。本ベンチマークは、10の長期履歴にわたる1,522の評価インスタンスを含み、1,090の関係制御された記憶バリアントセットに基づき、ユーザー関連および非ユーザー関連のクエリにわたる。 6つのスタンドアロンメモリシステム、ネイティブメモリモジュールを持つ2つのClaw型エージェント、およびプラグインメモリモジュールを持つ3つのClaw型エージェントを評価した結果、現在のシステムは細粒度の関係記憶識別において依然として脆弱であることが判明した。さらに、記憶保持、検索、下流推論の各段階にわたって異なる能力プロファイルを明らかにする診断プロトコルを導入する。

English

Persistent AI assistants, such as OpenClaw, accumulate large collections of related memories over long-term interactions. As these memories grow, they may reinforce one another, diverge across contexts, or directly conflict, making correct assistance depend on memory relations rather than isolated recall. Existing long-term memory benchmarks rarely probe how agents preserve and utilize such relations during downstream tasks. To address this gap, we introduce SubtleMemory, a benchmark for fine-grained relational memory discrimination in long-running AI agents. SubtleMemory constructs relation-controlled latent semantic artifacts whose variants instantiate complementary, nuanced, or contradictory relations, and embeds them into realistic user-agent histories, requiring agents to recover distributed relational structures during later queries and instructions. The benchmark contains 1,522 evaluation instances over 10 long histories, grounded in 1,090 relation-controlled memory-variant sets and spanning user-related and non-user-related queries. Evaluating six standalone memory systems, two Claw-style agents with native memory modules, and three Claw-style agents with plugin memory modules, we find that current systems remain weak on fine-grained relational memory discrimination. We further introduce diagnostic protocols that reveal distinct capability profiles across memory preservation, retrieval, and downstream reasoning stages.