STALE: LLM 에이전트는 자신의 메모리가 더 이상 유효하지 않은 시점을 알 수 있을까?

초록

대규모 언어 모델(LLM) 기반 에이전트는 점차 일관된 장기적 개인화 메모리를 유지할 것으로 기대되지만, 현재의 벤치마크는 주로 정적 사실 검색만을 측정하여 새로운 증거가 등장할 때 저장된 믿음을 수정하는 능력을 간과하고 있다. 본 연구는 중요하지만 충분히 탐구되지 않은 실패 유형인 암묵적 갈등(Implicit Conflict)을 식별한다. 이는 명시적 부정 없이 이후의 관찰이 이전 기억을 무효화하는 상황으로, 이를 탐지하기 위해 맥락적 추론과 상식적 추론이 요구된다. 본 능력을 엄격히 평가하기 위해, 100개 이상의 일상 주제에 걸쳐 최대 150K 토큰의 맥락을 포함하며 400개의 전문가 검증 갈등 시나리오(세 가지 탐색 차원에 걸친 1,200개의 평가 질의)로 구성된 벤치마크인 STALE을 제안한다. 또한 세 가지 차원의 탐색 프레임워크를 제안한다: 상태 해결(State Resolution, 이전 믿음이 구식임을 탐지), 전제 저항(Premise Resistance, 구식 상태를 거짓으로 전제하는 질의를 거부), 암묵적 정책 적응(Implicit Policy Adaptation, 하위 행동에 업데이트된 상태를 능동적으로 적용). 최첨단 LLM 및 특화된 메모리 프레임워크에 대한 체계적 평가는 업데이트된 증거를 검색하는 것과 이를 실제로 행동에 적용하는 것 사이에 광범위한 간극이 존재함을 보여주며, 최고 평가 모델조차 전체 정확도 55.2%에 그쳤다. 모델들은 사용자 질의에 내재된 구식 가정을 수용하는 경향이 있으며, 사용자 상태의 한 측면 변화가 관련 기억을 무효화해야 함을 인식하는 데 어려움을 겪는다. 상태 인식 메모리의 초기 기준선을 구축하기 위해, 본 연구는 구조화된 상태 통합 및 전파 인식 검색을 통해 쓰기 시점 수정을 강화하는 프로토타입 CUPMem을 추가로 제시하며, 이는 명시적 상태 조정이 강건한 에이전트 메모리를 위한 유망한 방향임을 시사한다.

English

Large Language Model (LLM) agents are increasingly expected to maintain coherent, long-term personalized memory, yet current benchmarks primarily measure static fact retrieval, overlooking the ability to revise stored beliefs when new evidence emerges. We identify a critical and underexplored failure mode, Implicit Conflict: a later observation invalidates an earlier memory without explicit negation, requiring contextual inference and commonsense reasoning to detect. To rigorously evaluate this capability, we introduce STALE, a benchmark of 400 expert-validated conflict scenarios (1,200 evaluation queries across three probing dimensions) spanning over 100 everyday topics with contexts up to 150K tokens. We propose a three-dimensional probing framework that tests State Resolution (detecting that a prior belief is outdated), Premise Resistance (rejecting queries that falsely presuppose a stale state), and Implicit Policy Adaptation (proactively applying updated states in downstream behavior). A systematic evaluation of frontier LLMs and specialized memory frameworks reveals a pervasive gap between retrieving updated evidence and acting on it, with even the best evaluated model achieving only 55.2% overall accuracy. Models often accept outdated assumptions embedded in a user's query, and they struggle to recognize when a change in one aspect of the user's state should invalidate related memories. To establish an initial baseline for state-aware memory, we further present CUPMem, a prototype that strengthens write-time revision through structured state consolidation and propagation-aware search, suggesting that explicit state adjudication is a promising direction for robust agentic memory.