AgentFold: 長期的視野を持つWebエージェントと能動的コンテキスト管理

要旨

LLMベースのWebエージェントは情報探索において非常に有望ですが、長期的なタスクにおける有効性は、コンテキスト管理における根本的なトレードオフによって妨げられています。従来のReActベースのエージェントは、ノイズの多い生の履歴を蓄積するにつれてコンテキスト飽和に悩まされる一方、各ステップで履歴全体を固定的に要約する手法は、重要な詳細情報が不可逆的に失われるリスクを伴います。これらの課題に対処するため、我々は人間の認知的プロセスである「回顧的統合」にヒントを得た、能動的コンテキスト管理を中核とする新しいエージェントパラダイム「AgentFold」を提案します。AgentFoldはコンテキストを、単に埋め尽くす受動的なログではなく、能動的に形成すべき動的な認知的作業空間として扱います。各ステップで、エージェントは「フォールディング」操作を実行することを学習します。この操作は履歴軌跡を複数のスケールで管理します：重要な細粒度の詳細を保持するための粒度の細かい凝縮や、複数ステップにわたるサブタスク全体を抽象化する深い統合を実行できます。主要ベンチマークでの結果は顕著です：単純な教師ありファインチューニング（継続事前学習や強化学習なし）のみで、我々のAgentFold-30B-A3BエージェントはBrowseCompで36.2%、BrowseComp-ZHで47.3%を達成しました。特筆すべきは、この性能がDeepSeek-V3.1-671B-A37Bのような大幅に大規模なオープンソースモデルを上回るか匹敵するだけでなく、OpenAIのo4-miniのような先行するプロプライエタリエージェントも凌駕している点です。

English

LLM-based web agents show immense promise for information seeking, yet their effectiveness on long-horizon tasks is hindered by a fundamental trade-off in context management. Prevailing ReAct-based agents suffer from context saturation as they accumulate noisy, raw histories, while methods that fixedly summarize the full history at each step risk the irreversible loss of critical details. Addressing these, we introduce AgentFold, a novel agent paradigm centered on proactive context management, inspired by the human cognitive process of retrospective consolidation. AgentFold treats its context as a dynamic cognitive workspace to be actively sculpted, rather than a passive log to be filled. At each step, it learns to execute a `folding' operation, which manages its historical trajectory at multiple scales: it can perform granular condensations to preserve vital, fine-grained details, or deep consolidations to abstract away entire multi-step sub-tasks. The results on prominent benchmarks are striking: with simple supervised fine-tuning (without continual pre-training or RL), our AgentFold-30B-A3B agent achieves 36.2% on BrowseComp and 47.3% on BrowseComp-ZH. Notably, this performance not only surpasses or matches open-source models of a dramatically larger scale, such as the DeepSeek-V3.1-671B-A37B, but also surpasses leading proprietary agents like OpenAI's o4-mini.