LongMemEval-V2：針對資深同事的長期智能體記憶評估

摘要

長期記憶對於在專業化網路環境中的智能體至關重要，成功與否取決於能否回憶介面可供性、狀態動態、工作流程及反覆出現的失敗模式。然而，現有的智能體記憶基準大多側重於使用者歷史記錄、短軌跡或下游任務的成功率，並未直接評估記憶系統是否能有效內化環境特定的經驗。為填補此缺口，我們提出LongMemEval-V2（LME-V2），這是一個用於評估記憶系統能否幫助智能體在客製化環境中習得成為熟練協作者所需經驗的基準。LME-V2包含451道手動篩選的問題，涵蓋網路智能體的五項核心記憶能力：靜態狀態回憶、動態狀態追蹤、工作流程知識、環境陷阱及前提感知。問題與包含多達500條軌跡及1.15億個詞元的歷史軌跡配對。我們採用上下文收集框架：記憶系統處理歷史軌跡後，提供精簡證據以支援下游問答。我們提出一套兩種記憶方法：AgentRunbook-R——一種基於RAG的高效率記憶方法，具備用於原始狀態觀測、事件及策略筆記的知識池；以及AgentRunbook-C——該方法將軌跡儲存為檔案，並在增強的沙盒環境中調用編碼智能體來收集證據。實驗顯示，AgentRunbook-C以72.5%的平均準確率達到最佳效能，優於最強的RAG基線（48.5%）及現成編碼智能體基線（69.3%）。儘管效能提升顯著，基於編碼智能體的方法仍有較高的延遲成本。雖然AgentRunbook-C推進了準確率-延遲的帕累託前沿，但仍有大幅改善空間。綜合而言，這些結果使LME-V2成為開發適用於環境經驗的長期記憶系統之具挑戰性的測試平台。

English

Long-term memory is crucial for agents in specialized web environments, where success depends on recalling interface affordances, state dynamics, workflows, and recurring failure modes. However, existing memory benchmarks for agents mostly focus on user histories, short traces, or downstream task success, leaving open how to directly evaluate whether memory systems effectively internalize environment-specific experience. To address this gap, we introduce LongMemEval-V2 (LME-V2), a benchmark for evaluating whether memory systems can help agents acquire the experience needed to become knowledgeable colleagues in customized environments. LME-V2 contains 451 manually curated questions covering five core memory abilities for web agents: static state recall, dynamic state tracking, workflow knowledge, environment gotchas, and premise awareness. Questions are paired with history trajectories containing up to 500 trajectories and 115M tokens. We use a context gathering formulation: memory systems consume history trajectories and return compact evidence for downstream question answering. We propose a suite of two memory methods: AgentRunbook-R, an efficient RAG-based memory with knowledge pools for raw state observations, events, and strategy notes, and AgentRunbook-C, which stores trajectories as files and invokes a coding agent to gather evidence in an augmented sandbox. Experiments show that AgentRunbook-C achieves the best performance with 72.5% average accuracy, outperforming the strongest RAG baseline (48.5%) and the off-the-shelf coding agent baseline (69.3%). Despite the strong performance gains, coding agent based methods have high latency costs. While AgentRunbook-C advances the accuracy-latency Pareto frontier, substantial room for improvement remains. Together, these results establish LME-V2 as a challenging testbed for developing long-term memory systems for environment experience.

LongMemEval-V2：針對資深同事的長期智能體記憶評估

LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues

摘要

Support