ChatPaper.aiChatPaper

STALE:大語言模型智能體能否知道其記憶何時不再有效?

STALE: Can LLM Agents Know When Their Memories Are No Longer Valid?

May 7, 2026
作者: Hanxiang Chao, Yihan Bai, Rui Sheng, Tianle Li, Yushi Sun
cs.AI

摘要

大型語言模型(LLM)智能體日益被期望能維持一致且長期的個人化記憶,然而現有的評測基準主要側重於靜態事實檢索,忽略了當新證據出現時修正儲存信念的能力。我們發現一個關鍵且尚未充分探討的失效模式——隱含衝突:後續的觀測在沒有明確否定的情況下,使先前的記憶失效,需要依賴語境推論與常識推理才能加以偵測。為嚴謹評估此能力,我們提出 STALE 基準——包含 400 個經專家驗證的衝突場景(涵蓋三個探測維度共 1,200 個評估查詢),橫跨超過 100 個日常主題,語境長度可達 150K 個 token。我們提出一個三維探測框架,分別測試:狀態解析(偵測先前的信念已過時)、前提抵抗(拒絕基於虛假舊狀態的查詢),以及隱含策略適應(在下游行為中主動應用更新後的狀態)。對前緣 LLM 及專門的記憶框架進行系統性評估後,發現從檢索到更新證據,再到基於證據行動之間存在普遍差距,即使表現最佳的模型總體準確率也僅達 55.2%。模型往往接受使用者查詢中的過時假設,且難以認知到使用者狀態某一層面的改變應如何使相關記憶失效。為建立狀態感知記憶的初步基準,我們進一步提出 CUPMem 原型,透過結構化狀態整合與傳播感知搜尋強化寫入時的修正機制,顯示明確的狀態裁決是邁向穩健智能體記憶的一個具前景的方向。
English
Large Language Model (LLM) agents are increasingly expected to maintain coherent, long-term personalized memory, yet current benchmarks primarily measure static fact retrieval, overlooking the ability to revise stored beliefs when new evidence emerges. We identify a critical and underexplored failure mode, Implicit Conflict: a later observation invalidates an earlier memory without explicit negation, requiring contextual inference and commonsense reasoning to detect. To rigorously evaluate this capability, we introduce STALE, a benchmark of 400 expert-validated conflict scenarios (1,200 evaluation queries across three probing dimensions) spanning over 100 everyday topics with contexts up to 150K tokens. We propose a three-dimensional probing framework that tests State Resolution (detecting that a prior belief is outdated), Premise Resistance (rejecting queries that falsely presuppose a stale state), and Implicit Policy Adaptation (proactively applying updated states in downstream behavior). A systematic evaluation of frontier LLMs and specialized memory frameworks reveals a pervasive gap between retrieving updated evidence and acting on it, with even the best evaluated model achieving only 55.2% overall accuracy. Models often accept outdated assumptions embedded in a user's query, and they struggle to recognize when a change in one aspect of the user's state should invalidate related memories. To establish an initial baseline for state-aware memory, we further present CUPMem, a prototype that strengthens write-time revision through structured state consolidation and propagation-aware search, suggesting that explicit state adjudication is a promising direction for robust agentic memory.