SubtleMemory：一個用於長時程AI代理中細粒度關係記憶區分的基準

摘要

持久性AI助手（例如OpenClaw）在長期互動中會累積大量彼此關聯的記憶。隨著記憶增長，這些記憶可能相互強化、因情境而分化，甚至直接產生衝突，使得正確的輔助行為取決於記憶間的關係，而非單純的孤立回憶。現有的長期記憶基準測試很少探討代理人在下游任務中如何保存並運用此類關係。為填補此缺口，我們提出SubtleMemory——一個針對長期運行AI代理中細粒度關係記憶辨別的基準測試。SubtleMemory構建了關係控制的潛在語義偽影，其變體可體現互補、細微或矛盾的關係，並將其嵌入真實的使用者-代理人歷史記錄中，要求代理人在後續查詢與指令中恢復分散的關係結構。該基準測試涵蓋10段長歷史記錄中的1,522個評估實例，以1,090組關係控制的記憶變體集為基礎，並橫跨與使用者相關及非相關的查詢。我們評估了六個獨立記憶系統、兩個內建記憶模組的Claw風格代理人，以及三個採用插件記憶模組的Claw風格代理人，結果顯示當前系統在細粒度關係記憶辨別上仍顯薄弱。我們進一步引入診斷協議，揭示了在記憶保留、檢索與下游推理階段中不同的能力特徵。

English

Persistent AI assistants, such as OpenClaw, accumulate large collections of related memories over long-term interactions. As these memories grow, they may reinforce one another, diverge across contexts, or directly conflict, making correct assistance depend on memory relations rather than isolated recall. Existing long-term memory benchmarks rarely probe how agents preserve and utilize such relations during downstream tasks. To address this gap, we introduce SubtleMemory, a benchmark for fine-grained relational memory discrimination in long-running AI agents. SubtleMemory constructs relation-controlled latent semantic artifacts whose variants instantiate complementary, nuanced, or contradictory relations, and embeds them into realistic user-agent histories, requiring agents to recover distributed relational structures during later queries and instructions. The benchmark contains 1,522 evaluation instances over 10 long histories, grounded in 1,090 relation-controlled memory-variant sets and spanning user-related and non-user-related queries. Evaluating six standalone memory systems, two Claw-style agents with native memory modules, and three Claw-style agents with plugin memory modules, we find that current systems remain weak on fine-grained relational memory discrimination. We further introduce diagnostic protocols that reveal distinct capability profiles across memory preservation, retrieval, and downstream reasoning stages.