SubtleMemory：面向长期AI智能体的细粒度关系记忆辨别基准

摘要

持久性AI助手（例如OpenClaw）会在长期交互中积累大量相互关联的记忆。随着记忆不断增长，这些记忆可能相互强化、在不同情境下产生分歧，甚至直接冲突，使正确辅助取决于记忆关系而非孤立回忆。现有长期记忆基准测试很少探究代理在下游任务中如何保留并利用此类关系。为填补这一空白，我们提出SubtleMemory——一个用于长期运行AI代理中细粒度关系记忆辨别的基准测试。SubtleMemory构建了关系控制的潜在语义构件，其变体可体现互补性、细微差别或矛盾关系，并将其嵌入逼真的用户-代理交互历史中，要求代理在后续查询与指令中恢复分布式关系结构。该基准测试包含基于10段长历史的1,522个评估实例，依托1,090组关系控制记忆变体集，涵盖用户相关及用户无关的查询。通过评估六套独立记忆系统、两个带有原生记忆模块的Claw型代理，以及三个带有插件记忆模块的Claw型代理，我们发现当前系统在细粒度关系记忆辨别方面仍显薄弱。我们进一步引入诊断协议，揭示其在记忆保留、检索及下游推理阶段的不同能力画像。

English

Persistent AI assistants, such as OpenClaw, accumulate large collections of related memories over long-term interactions. As these memories grow, they may reinforce one another, diverge across contexts, or directly conflict, making correct assistance depend on memory relations rather than isolated recall. Existing long-term memory benchmarks rarely probe how agents preserve and utilize such relations during downstream tasks. To address this gap, we introduce SubtleMemory, a benchmark for fine-grained relational memory discrimination in long-running AI agents. SubtleMemory constructs relation-controlled latent semantic artifacts whose variants instantiate complementary, nuanced, or contradictory relations, and embeds them into realistic user-agent histories, requiring agents to recover distributed relational structures during later queries and instructions. The benchmark contains 1,522 evaluation instances over 10 long histories, grounded in 1,090 relation-controlled memory-variant sets and spanning user-related and non-user-related queries. Evaluating six standalone memory systems, two Claw-style agents with native memory modules, and three Claw-style agents with plugin memory modules, we find that current systems remain weak on fine-grained relational memory discrimination. We further introduce diagnostic protocols that reveal distinct capability profiles across memory preservation, retrieval, and downstream reasoning stages.