フロー内エージェンシックシステム最適化による効果的計画とツール活用

要旨

アウトカム駆動型強化学習は大規模言語モデル（LLM）における推論を進化させてきたが、現在主流のツール拡張アプローチでは、単一のモノリシックなポリシーを訓練し、完全なコンテキスト下で思考とツール呼び出しを交互に行う。この方法は、長い時間軸や多様なツールに対してスケーラビリティが低く、新しいシナリオへの汎化能力も弱い。エージェントシステムは、専門化されたモジュール間で作業を分解する有望な代替手段を提供するが、ほとんどのシステムは訓練不要であるか、多ターンインタラクションの動的な環境から切り離されたオフライン訓練に依存している。本論文では、進化するメモリを通じて4つのモジュール（プランナー、エグゼキューター、検証器、ジェネレーター）を調整し、多ターンループ内でプランナーを直接最適化する訓練可能なイン・ザ・フロー型エージェントフレームワーク「AgentFlow」を提案する。ライブ環境でのオン・ポリシー訓練のために、多ターン最適化を一連の扱いやすい単一ターンのポリシー更新に変換することで、長い時間軸とスパースな報酬のクレジット割り当てを解決する「Flow-based Group Refined Policy Optimization（Flow-GRPO）」を提案する。これは、検証可能な単一の軌跡レベルのアウトカムを各ターンにブロードキャストし、ローカルなプランナー決定をグローバルな成功と整合させ、グループ正規化されたアドバンテージで学習を安定化する。10のベンチマークにおいて、7Bスケールのバックボーンを持つAgentFlowは、検索タスクで14.9%、エージェントタスクで14.0%、数学タスクで14.5%、科学タスクで4.1%の平均精度向上を達成し、GPT-4oのような大規模なプロプライエタリモデルを凌駕した。さらに、イン・ザ・フロー最適化の利点を確認する分析を行い、計画の改善、ツール呼び出しの信頼性向上、モデルサイズと推論ターンに対するポジティブなスケーリングを示した。

English

Outcome-driven reinforcement learning has advanced reasoning in large language models (LLMs), but prevailing tool-augmented approaches train a single, monolithic policy that interleaves thoughts and tool calls under full context; this scales poorly with long horizons and diverse tools and generalizes weakly to new scenarios. Agentic systems offer a promising alternative by decomposing work across specialized modules, yet most remain training-free or rely on offline training decoupled from the live dynamics of multi-turn interaction. We introduce AgentFlow, a trainable, in-the-flow agentic framework that coordinates four modules (planner, executor, verifier, generator) through an evolving memory and directly optimizes its planner inside the multi-turn loop. To train on-policy in live environments, we propose Flow-based Group Refined Policy Optimization (Flow-GRPO), which tackles long-horizon, sparse-reward credit assignment by converting multi-turn optimization into a sequence of tractable single-turn policy updates. It broadcasts a single, verifiable trajectory-level outcome to every turn to align local planner decisions with global success and stabilizes learning with group-normalized advantages. Across ten benchmarks, AgentFlow with a 7B-scale backbone outperforms top-performing baselines with average accuracy gains of 14.9% on search, 14.0% on agentic, 14.5% on mathematical, and 4.1% on scientific tasks, even surpassing larger proprietary models like GPT-4o. Further analyses confirm the benefits of in-the-flow optimization, showing improved planning, enhanced tool-calling reliability, and positive scaling with model size and reasoning turns.

フロー内エージェンシックシステム最適化による効果的計画とツール活用

In-the-Flow Agentic System Optimization for Effective Planning and Tool Use

要旨

Support