LongMemEval-V2: 経験豊富な同僚に対するエージェントの長期記憶の評価

要旨

長期記憶は、専門的なウェブ環境においてエージェントにとって極めて重要であり、成功はインターフェースのアフォーダンス、状態の動的変化、ワークフロー、繰り返し発生する障害モードを想起できるかどうかに依存する。しかし、既存のエージェント向け記憶ベンチマークは、主にユーザー履歴、短いトレース、または下流タスクの成功に焦点を当てており、記憶システムが環境固有の経験を効果的に内在化するかどうかを直接評価する手段は未整備のままである。このギャップに対処するため、我々はLongMemEval-V2（LME-V2）を導入する。これは、記憶システムがエージェントがカスタマイズ環境において知識豊富な同僚となるために必要な経験を獲得するのに役立つかどうかを評価するベンチマークである。LME-V2には、ウェブエージェントの5つの中核的な記憶能力（静的状態の想起、動的状態の追跡、ワークフロー知識、環境に関する落とし穴、前提認識）をカバーする、手作業で厳選された451の質問が含まれている。各質問には、最大500のトレースと1億1500万トークンを含む履歴軌跡がペアとして付随する。我々は文脈収集形式を採用する。記憶システムが履歴軌跡を消費し、下流の質問応答のためにコンパクトな証拠を返す。我々は2つの記憶手法のスイートを提案する。AgentRunbook-Rは、生の状態観測、イベント、戦略ノート用の知識プールを備えた効率的なRAGベースの記憶であり、AgentRunbook-Cは軌跡をファイルとして保存し、拡張サンドボックス内で証拠を収集するためにコーディングエージェントを呼び出す。実験では、AgentRunbook-Cが平均精度72.5%で最高の性能を達成し、最も強力なRAGベースライン（48.5%）および既製のコーディングエージェントベースライン（69.3%）を上回った。しかし、大幅な性能向上にもかかわらず、コーディングエージェントベースの手法は高いレイテンシコストを伴う。AgentRunbook-Cは精度-レイテンシのパレートフロンティアを前進させるものの、改善の余地は依然として大きい。これらの結果は、LME-V2が環境経験のための長期記憶システムを開発するための困難なテストベッドであることを示している。

English

Long-term memory is crucial for agents in specialized web environments, where success depends on recalling interface affordances, state dynamics, workflows, and recurring failure modes. However, existing memory benchmarks for agents mostly focus on user histories, short traces, or downstream task success, leaving open how to directly evaluate whether memory systems effectively internalize environment-specific experience. To address this gap, we introduce LongMemEval-V2 (LME-V2), a benchmark for evaluating whether memory systems can help agents acquire the experience needed to become knowledgeable colleagues in customized environments. LME-V2 contains 451 manually curated questions covering five core memory abilities for web agents: static state recall, dynamic state tracking, workflow knowledge, environment gotchas, and premise awareness. Questions are paired with history trajectories containing up to 500 trajectories and 115M tokens. We use a context gathering formulation: memory systems consume history trajectories and return compact evidence for downstream question answering. We propose a suite of two memory methods: AgentRunbook-R, an efficient RAG-based memory with knowledge pools for raw state observations, events, and strategy notes, and AgentRunbook-C, which stores trajectories as files and invokes a coding agent to gather evidence in an augmented sandbox. Experiments show that AgentRunbook-C achieves the best performance with 72.5% average accuracy, outperforming the strongest RAG baseline (48.5%) and the off-the-shelf coding agent baseline (69.3%). Despite the strong performance gains, coding agent based methods have high latency costs. While AgentRunbook-C advances the accuracy-latency Pareto frontier, substantial room for improvement remains. Together, these results establish LME-V2 as a challenging testbed for developing long-term memory systems for environment experience.

LongMemEval-V2: 経験豊富な同僚に対するエージェントの長期記憶の評価

LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues

要旨

Support