LongMemEval-V2：面向经验丰富同事的长期智能体记忆评估

摘要

长期记忆对于在专业网络环境中工作的智能体至关重要，其成功取决于能否回忆界面功能、状态动态、工作流程以及反复出现的失败模式。然而，现有面向智能体的记忆基准主要聚焦于用户历史、简短轨迹或下游任务成功率，未能直接评估记忆系统是否有效内化环境特定经验。为填补这一空白，我们提出LongMemEval-V2（LME-V2）基准，用于评估记忆系统能否帮助智能体获取在定制化环境中成为知识渊博协作者所需的经验。LME-V2包含451个经人工精心设计的问题，覆盖网络智能体的五项核心记忆能力：静态状态回忆、动态状态追踪、工作流程知识、环境陷阱识别以及前提意识。每个问题均配备包含多达500条轨迹（总计1.15亿个词元）的历史轨迹数据。我们采用上下文汇聚框架：记忆系统处理历史轨迹后，为下游问答任务输出精炼证据。此外，我们提出两种记忆方法：AgentRunbook-R（基于高效RAG的记忆系统，通过知识池存储原始状态观测、事件和策略笔记）与AgentRunbook-C（将轨迹存储为文件，并调用编码智能体在增强沙盒中收集证据）。实验表明，AgentRunbook-C以72.5%的平均准确率取得最优性能，超越最强RAG基线（48.5%）与现成编码智能体基线（69.3%）。尽管性能提升显著，但基于编码智能体的方法存在高延迟成本。虽AgentRunbook-C推进了准确率与延迟的帕累托前沿，但仍有大幅改进空间。综上，这些结果将LME-V2确立为开发环境经验长期记忆系统的挑战性测试平台。

English

Long-term memory is crucial for agents in specialized web environments, where success depends on recalling interface affordances, state dynamics, workflows, and recurring failure modes. However, existing memory benchmarks for agents mostly focus on user histories, short traces, or downstream task success, leaving open how to directly evaluate whether memory systems effectively internalize environment-specific experience. To address this gap, we introduce LongMemEval-V2 (LME-V2), a benchmark for evaluating whether memory systems can help agents acquire the experience needed to become knowledgeable colleagues in customized environments. LME-V2 contains 451 manually curated questions covering five core memory abilities for web agents: static state recall, dynamic state tracking, workflow knowledge, environment gotchas, and premise awareness. Questions are paired with history trajectories containing up to 500 trajectories and 115M tokens. We use a context gathering formulation: memory systems consume history trajectories and return compact evidence for downstream question answering. We propose a suite of two memory methods: AgentRunbook-R, an efficient RAG-based memory with knowledge pools for raw state observations, events, and strategy notes, and AgentRunbook-C, which stores trajectories as files and invokes a coding agent to gather evidence in an augmented sandbox. Experiments show that AgentRunbook-C achieves the best performance with 72.5% average accuracy, outperforming the strongest RAG baseline (48.5%) and the off-the-shelf coding agent baseline (69.3%). Despite the strong performance gains, coding agent based methods have high latency costs. While AgentRunbook-C advances the accuracy-latency Pareto frontier, substantial room for improvement remains. Together, these results establish LME-V2 as a challenging testbed for developing long-term memory systems for environment experience.

LongMemEval-V2：面向经验丰富同事的长期智能体记忆评估

LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues

摘要

Support