SubtleMemory: 장기적 AI 에이전트에서의 세밀한 관계 기억 변별을 위한 벤치마크

초록

지속형 AI 어시스턴트(예: OpenClaw)는 장기 상호작용을 통해 방대한 관련 기억 컬렉션을 축적한다. 이러한 기억이 증가함에 따라 서로 강화되거나, 맥락에 따라 분기되거나, 직접적으로 충돌할 수 있으며, 이에 따라 올바른 지원은 고립된 회상보다는 기억 관계에 의존하게 된다. 기존 장기 기억 벤치마크는 에이전트가 하위 작업을 수행하는 동안 이러한 관계를 어떻게 활용하고 보존하는지 거의 평가하지 않는다. 이러한 격차를 해소하기 위해, 우리는 장기 실행 AI 에이전트에서 세분화된 관계형 기억 식별을 위한 벤치마크인 SubtleMemory를 소개한다. SubtleMemory는 관계 제어된 잠재 의미 인공물을 구성하며, 그 변형은 상호 보완적, 미묘하거나 모순적인 관계를 구체화하고, 이를 현실적인 사용자-에이전트 상호작용 기록에 내장하여, 이후의 질의와 지시에서 에이전트가 분산된 관계형 구조를 복구하도록 요구한다. 해당 벤치마크는 1,090개의 관계 제어된 기억 변형 세트를 기반으로 하여 사용자 관련 및 비사용자 관련 질의를 포괄하는 10개의 긴 기록에 걸쳐 1,522개의 평가 인스턴스를 포함한다. 여섯 개의 독립형 기억 시스템, 네이티브 기억 모듈을 갖춘 두 개의 Claw 스타일 에이전트, 그리고 플러그인 기억 모듈을 갖춘 세 개의 Claw 스타일 에이전트를 평가한 결과, 현재 시스템은 세분화된 관계형 기억 식별에서 여전히 취약함을 발견했다. 또한, 기억 보존, 검색 및 하향 추론 단계에 걸쳐 뚜렷한 역량 프로필을 드러내는 진단 프로토콜을 추가로 도입한다.

English

Persistent AI assistants, such as OpenClaw, accumulate large collections of related memories over long-term interactions. As these memories grow, they may reinforce one another, diverge across contexts, or directly conflict, making correct assistance depend on memory relations rather than isolated recall. Existing long-term memory benchmarks rarely probe how agents preserve and utilize such relations during downstream tasks. To address this gap, we introduce SubtleMemory, a benchmark for fine-grained relational memory discrimination in long-running AI agents. SubtleMemory constructs relation-controlled latent semantic artifacts whose variants instantiate complementary, nuanced, or contradictory relations, and embeds them into realistic user-agent histories, requiring agents to recover distributed relational structures during later queries and instructions. The benchmark contains 1,522 evaluation instances over 10 long histories, grounded in 1,090 relation-controlled memory-variant sets and spanning user-related and non-user-related queries. Evaluating six standalone memory systems, two Claw-style agents with native memory modules, and three Claw-style agents with plugin memory modules, we find that current systems remain weak on fine-grained relational memory discrimination. We further introduce diagnostic protocols that reveal distinct capability profiles across memory preservation, retrieval, and downstream reasoning stages.