流内代理系统优化:实现高效规划与工具使用
In-the-Flow Agentic System Optimization for Effective Planning and Tool Use
October 7, 2025
作者: Zhuofeng Li, Haoxiang Zhang, Seungju Han, Sheng Liu, Jianwen Xie, Yu Zhang, Yejin Choi, James Zou, Pan Lu
cs.AI
摘要
基于结果的强化学习推动了大型语言模型(LLMs)的推理能力发展,但当前主流的工具增强方法训练的是一个单一、整体的策略,该策略在完整上下文中交替进行思考与工具调用;这种方法在处理长时程任务和多样化工具时扩展性差,且在新场景下泛化能力弱。代理系统通过将工作分解到专门模块中提供了一种有前景的替代方案,然而大多数系统仍保持无训练状态或依赖于与多轮交互实时动态脱节的离线训练。我们引入了AgentFlow,一个可训练的、实时运行的代理框架,它通过不断演进的记忆协调四个模块(规划器、执行器、验证器、生成器),并直接在多轮循环中优化其规划器。为了在实时环境中进行在线策略训练,我们提出了基于流的群体精炼策略优化(Flow-GRPO),它通过将多轮优化转化为一系列可处理的单轮策略更新,解决了长时程、稀疏奖励的信用分配问题。该方法将单一可验证的轨迹级结果广播至每一轮,使局部规划决策与全局成功对齐,并通过群体归一化优势稳定学习过程。在十个基准测试中,配备7B规模骨干的AgentFlow在搜索、代理、数学和科学任务上的平均准确率分别提升了14.9%、14.0%、14.5%和4.1%,甚至超越了如GPT-4o等更大的专有模型。进一步分析证实了实时优化的优势,显示出改进的规划能力、增强的工具调用可靠性,以及随模型规模和推理轮次增加的正向扩展性。
English
Outcome-driven reinforcement learning has advanced reasoning in large
language models (LLMs), but prevailing tool-augmented approaches train a
single, monolithic policy that interleaves thoughts and tool calls under full
context; this scales poorly with long horizons and diverse tools and
generalizes weakly to new scenarios. Agentic systems offer a promising
alternative by decomposing work across specialized modules, yet most remain
training-free or rely on offline training decoupled from the live dynamics of
multi-turn interaction. We introduce AgentFlow, a trainable, in-the-flow
agentic framework that coordinates four modules (planner, executor, verifier,
generator) through an evolving memory and directly optimizes its planner inside
the multi-turn loop. To train on-policy in live environments, we propose
Flow-based Group Refined Policy Optimization (Flow-GRPO), which tackles
long-horizon, sparse-reward credit assignment by converting multi-turn
optimization into a sequence of tractable single-turn policy updates. It
broadcasts a single, verifiable trajectory-level outcome to every turn to align
local planner decisions with global success and stabilizes learning with
group-normalized advantages. Across ten benchmarks, AgentFlow with a 7B-scale
backbone outperforms top-performing baselines with average accuracy gains of
14.9% on search, 14.0% on agentic, 14.5% on mathematical, and 4.1% on
scientific tasks, even surpassing larger proprietary models like GPT-4o.
Further analyses confirm the benefits of in-the-flow optimization, showing
improved planning, enhanced tool-calling reliability, and positive scaling with
model size and reasoning turns.