EvoArena：动态环境中鲁棒LLM智能体的记忆演化追踪

摘要

大型语言模型（LLM）代理在众多基准测试中展现出强大性能，但大多数评估都假设环境是静态的。相比之下，实际部署具有内在的动态性，要求代理持续调整其知识、技能和行为以适应不断变化的环境和更新的任务条件。为解决这一差距，我们引入了EvoArena——一个基准测试套件，它将环境变化建模为终端、软件和社交领域中一系列渐进式更新的序列。我们进一步提出了EvoMem，一种基于补丁的内存范式，将记忆演化记录为结构化的更新历史，使代理能够通过内存变化推理环境的演化过程。实验表明，当前代理在EvoArena上表现不佳，在演化的终端、软件和社交偏好领域平均准确率仅为39.6%。EvoMem持续提升性能，在EvoArena上平均提高1.5%，同时也在GAIA和LoCoMo等标准基准测试上分别提升6.1%和4.8%。除了单个任务，EvoMem在EvoArena上还将链级准确率提升了3.7%，其中成功需要完成一系列连续的相关演化子任务。机制分析表明，EvoMem改善了内存中的证据捕获，表明其能更好地保留完整的演化环境状态。我们的结果凸显了在评估和内存中对演化进行建模对于代理可靠部署的重要性。

English

Large language model (LLM) agents have achieved strong performance on a wide range of benchmarks, yet most evaluations assume static environments. In contrast, real-world deployment is inherently dynamic, requiring agents to continually align their knowledge, skills, and behavior with changing environments and updated task conditions. To address this gap, we introduce EvoArena, a benchmark suite that models environment changes as sequences of progressive updates across terminal, software, and social domains. We further propose EvoMem, a patch-based memory paradigm that records memory evolution as structured update histories, enabling agents to reason about environmental evolution through changes in their memory. Experiments show that current agents struggle on EvoArena, achieving an average accuracy of 39.6% across evolving terminal, software, and social-preference domains. EvoMem consistently improves performance, yielding an average gain of 1.5% on EvoArena and also improving standard benchmarks such as GAIA and LoCoMo by 6.1% and 4.8%. Beyond individual tasks, EvoMem further improves chain-level accuracy by 3.7% on EvoArena, where success requires completing a consecutive sequence of related evolutionary subtasks. Mechanistic analysis shows that EvoMem improves evidence capture in the memory, indicating better preservation of complete evolving environment states. Our results highlight the importance of modeling evolution in both evaluation and memory for reliable agent deployment.