STALE: LLMエージェントは自身の記憶がもはや有効でないことを知ることができるか

要旨

大規模言語モデル（LLM）エージェントは、首尾一貫した長期的なパーソナライズ記憶を維持することがますます期待されているが、現在のベンチマークは主に静的なファクト検索を測定しており、新たな証拠が現れた際に保持している信念を修正する能力を見落としている。我々は、暗黙的矛盾という、これまで十分に探求されていない重大な障害モードを特定する。これは、後の観察が明示的な否定なしに以前の記憶を無効化するものであり、その検出には文脈推論と常識的推論が必要となる。この能力を厳密に評価するために、我々はSTALEを導入する。これは、100以上の日常トピックにわたり、最大150Kトークンのコンテキストを持つ、専門家検証済みの400の矛盾シナリオ（3つのプロービング次元にわたる1,200の評価クエリ）からなるベンチマークである。我々は、3次元のプロービングフレームワークを提案する。これは、状態解決（以前の信念が時代遅れであることを検出する）、前提抵抗（古い状態を誤って前提とするクエリを拒否する）、および暗黙方針適応（下流の行動において更新された状態を積極的に適用する）をテストする。最先端LLMおよび専用記憶フレームワークの体系的な評価により、更新された証拠の検索とそれに基づく行動の間には広範な乖離があり、最高評価モデルでも全体の精度は55.2%にとどまることが明らかになった。モデルはしばしばユーザーのクエリに埋め込まれた時代遅れの前提を受け入れ、ユーザーの状態のある側面の変化が関連する記憶を無効化すべきであることを認識するのに苦労する。状態認識記憶の初期ベースラインを確立するために、我々はさらにCUPMemを提示する。これは、構造化状態統合と伝搬認識検索を通じて書き込み時修正を強化するプロトタイプであり、明示的な状態裁定がロバストなエージェント記憶の有望な方向性であることを示唆している。

English

Large Language Model (LLM) agents are increasingly expected to maintain coherent, long-term personalized memory, yet current benchmarks primarily measure static fact retrieval, overlooking the ability to revise stored beliefs when new evidence emerges. We identify a critical and underexplored failure mode, Implicit Conflict: a later observation invalidates an earlier memory without explicit negation, requiring contextual inference and commonsense reasoning to detect. To rigorously evaluate this capability, we introduce STALE, a benchmark of 400 expert-validated conflict scenarios (1,200 evaluation queries across three probing dimensions) spanning over 100 everyday topics with contexts up to 150K tokens. We propose a three-dimensional probing framework that tests State Resolution (detecting that a prior belief is outdated), Premise Resistance (rejecting queries that falsely presuppose a stale state), and Implicit Policy Adaptation (proactively applying updated states in downstream behavior). A systematic evaluation of frontier LLMs and specialized memory frameworks reveals a pervasive gap between retrieving updated evidence and acting on it, with even the best evaluated model achieving only 55.2% overall accuracy. Models often accept outdated assumptions embedded in a user's query, and they struggle to recognize when a change in one aspect of the user's state should invalidate related memories. To establish an initial baseline for state-aware memory, we further present CUPMem, a prototype that strengthens write-time revision through structured state consolidation and propagation-aware search, suggesting that explicit state adjudication is a promising direction for robust agentic memory.