EvoArena: 동적 환경에서 강건한 LLM 에이전트를 위한 메모리 진화 추적

초록

대규모 언어 모델(LLM) 에이전트는 다양한 벤치마크에서 강력한 성능을 달성했지만, 대부분의 평가는 정적 환경을 가정한다. 반면, 실제 배포는 본질적으로 동적이며, 에이전트는 변화하는 환경과 업데이트된 작업 조건에 맞춰 지식, 기술 및 행동을 지속적으로 정렬해야 한다. 이러한 격차를 해소하기 위해, 우리는 터미널, 소프트웨어, 소셜 도메인에 걸쳐 환경 변화를 점진적 업데이트의 연속으로 모델링하는 벤치마크 제품군인 EvoArena를 소개한다. 또한, 메모리 진화를 구조화된 업데이트 이력으로 기록하여 에이전트가 메모리 변화를 통해 환경 진화를 추론할 수 있게 하는 패치 기반 메모리 패러다임인 EvoMem을 제안한다. 실험 결과, 현재 에이전트는 EvoArena에서 평균 39.6%의 정확도를 보이며 어려움을 겪는다. EvoMem은 성능을 일관되게 개선하여 EvoArena에서 평균 1.5% 향상시켰으며, GAIA 및 LoCoMo와 같은 표준 벤치마크에서도 각각 6.1%와 4.8%의 개선을 보였다. 개별 작업을 넘어, EvoMem은 연속된 관련 진화 하위 작업을 완료해야 하는 EvoArena에서 체인 수준 정확도를 3.7% 향상시킨다. 메커니즘 분석은 EvoMem이 메모리 내 증거 포착을 개선하여 완전한 진화 중인 환경 상태의 보존이 더 잘 이루어짐을 시사한다. 우리의 결과는 신뢰할 수 있는 에이전트 배포를 위해 평가와 메모리 모두에서 진화를 모델링하는 것의 중요성을 강조한다.

English

Large language model (LLM) agents have achieved strong performance on a wide range of benchmarks, yet most evaluations assume static environments. In contrast, real-world deployment is inherently dynamic, requiring agents to continually align their knowledge, skills, and behavior with changing environments and updated task conditions. To address this gap, we introduce EvoArena, a benchmark suite that models environment changes as sequences of progressive updates across terminal, software, and social domains. We further propose EvoMem, a patch-based memory paradigm that records memory evolution as structured update histories, enabling agents to reason about environmental evolution through changes in their memory. Experiments show that current agents struggle on EvoArena, achieving an average accuracy of 39.6% across evolving terminal, software, and social-preference domains. EvoMem consistently improves performance, yielding an average gain of 1.5% on EvoArena and also improving standard benchmarks such as GAIA and LoCoMo by 6.1% and 4.8%. Beyond individual tasks, EvoMem further improves chain-level accuracy by 3.7% on EvoArena, where success requires completing a consecutive sequence of related evolutionary subtasks. Mechanistic analysis shows that EvoMem improves evidence capture in the memory, indicating better preservation of complete evolving environment states. Our results highlight the importance of modeling evolution in both evaluation and memory for reliable agent deployment.