STALE：大語言模型智能體能否知道其記憶何時不再有效？

摘要

大型語言模型（LLM）智能體日益被期望能維持一致且長期的個人化記憶，然而現有的評測基準主要側重於靜態事實檢索，忽略了當新證據出現時修正儲存信念的能力。我們發現一個關鍵且尚未充分探討的失效模式——隱含衝突：後續的觀測在沒有明確否定的情況下，使先前的記憶失效，需要依賴語境推論與常識推理才能加以偵測。為嚴謹評估此能力，我們提出 STALE 基準——包含 400 個經專家驗證的衝突場景（涵蓋三個探測維度共 1,200 個評估查詢），橫跨超過 100 個日常主題，語境長度可達 150K 個 token。我們提出一個三維探測框架，分別測試：狀態解析（偵測先前的信念已過時）、前提抵抗（拒絕基於虛假舊狀態的查詢），以及隱含策略適應（在下游行為中主動應用更新後的狀態）。對前緣 LLM 及專門的記憶框架進行系統性評估後，發現從檢索到更新證據，再到基於證據行動之間存在普遍差距，即使表現最佳的模型總體準確率也僅達 55.2%。模型往往接受使用者查詢中的過時假設，且難以認知到使用者狀態某一層面的改變應如何使相關記憶失效。為建立狀態感知記憶的初步基準，我們進一步提出 CUPMem 原型，透過結構化狀態整合與傳播感知搜尋強化寫入時的修正機制，顯示明確的狀態裁決是邁向穩健智能體記憶的一個具前景的方向。

English

Large Language Model (LLM) agents are increasingly expected to maintain coherent, long-term personalized memory, yet current benchmarks primarily measure static fact retrieval, overlooking the ability to revise stored beliefs when new evidence emerges. We identify a critical and underexplored failure mode, Implicit Conflict: a later observation invalidates an earlier memory without explicit negation, requiring contextual inference and commonsense reasoning to detect. To rigorously evaluate this capability, we introduce STALE, a benchmark of 400 expert-validated conflict scenarios (1,200 evaluation queries across three probing dimensions) spanning over 100 everyday topics with contexts up to 150K tokens. We propose a three-dimensional probing framework that tests State Resolution (detecting that a prior belief is outdated), Premise Resistance (rejecting queries that falsely presuppose a stale state), and Implicit Policy Adaptation (proactively applying updated states in downstream behavior). A systematic evaluation of frontier LLMs and specialized memory frameworks reveals a pervasive gap between retrieving updated evidence and acting on it, with even the best evaluated model achieving only 55.2% overall accuracy. Models often accept outdated assumptions embedded in a user's query, and they struggle to recognize when a change in one aspect of the user's state should invalidate related memories. To establish an initial baseline for state-aware memory, we further present CUPMem, a prototype that strengthens write-time revision through structured state consolidation and propagation-aware search, suggesting that explicit state adjudication is a promising direction for robust agentic memory.