多智能体深度研究:基于M-GRPO的多智能体系统训练
Multi-Agent Deep Research: Training Multi-Agent Systems with M-GRPO
November 17, 2025
作者: Haoyang Hong, Jiajun Yin, Yuan Wang, Jingnan Liu, Zhe Chen, Ailing Yu, Ji Li, Zhiling Ye, Hansong Xiao, Yefei Chen, Hualei Zhou, Yun Yue, Minghui Yang, Chunxiao Guo, Junwei Liu, Peng Wei, Jinjie Gu
cs.AI
摘要
多智能体系统在通用推理任务中表现出色,但在专业领域的训练缺失制约了其准确性。当前训练方法为系统中所有智能体训练统一的大语言模型(LLM),由于不同智能体底层数据分布的差异,这种模式可能限制系统性能。因此,采用差异化LLM训练多智能体系统应成为下一步研究方向。然而该方法会引入新的优化挑战:智能体运行频率不同、任务执行过程中涉及可变子智能体调用、且智能体常部署于独立服务器导致端到端梯度流中断。针对这些问题,我们提出M-GRPO——面向垂直多智能体系统(包含主智能体规划器和多轮工具执行子智能体)的群组相对策略优化分层扩展框架。M-GRPO通过计算主智能体与子智能体的群组相对优势度实现分层信用分配,并设计轨迹对齐方案以应对可变子智能体调用产生的批次尺寸差异。我们部署了去耦合训练管道,使智能体可在独立服务器运行,仅通过共享存储交换最小统计量,实现无需跨服务器反向传播的可扩展训练。在真实场景基准测试(如GAIA、XBench-DeepSearch和WebWalkerQA)中,M-GRPO在稳定性和样本效率上均优于单智能体GRPO及子智能体冻结的多智能体GRPO。结果表明,对齐异构轨迹与去耦合专业化智能体优化能有效增强工具增强型推理任务性能。
English
Multi-agent systems perform well on general reasoning tasks. However, the lack of training in specialized areas hinders their accuracy. Current training methods train a unified large language model (LLM) for all agents in the system. This may limit the performances due to different distributions underlying for different agents. Therefore, training multi-agent systems with distinct LLMs should be the next step to solve. However, this approach introduces optimization challenges. For example, agents operate at different frequencies, rollouts involve varying sub-agent invocations, and agents are often deployed across separate servers, disrupting end-to-end gradient flow. To address these issues, we propose M-GRPO, a hierarchical extension of Group Relative Policy Optimization designed for vertical Multi-agent systems with a main agent (planner) and multiple sub-agents (multi-turn tool executors). M-GRPO computes group-relative advantages for both main and sub-agents, maintaining hierarchical credit assignment. It also introduces a trajectory-alignment scheme that generates fixed-size batches despite variable sub-agent invocations. We deploy a decoupled training pipeline in which agents run on separate servers and exchange minimal statistics via a shared store. This enables scalable training without cross-server backpropagation. In experiments on real-world benchmarks (e.g., GAIA, XBench-DeepSearch, and WebWalkerQA), M-GRPO consistently outperforms both single-agent GRPO and multi-agent GRPO with frozen sub-agents, demonstrating improved stability and sample efficiency. These results show that aligning heterogeneous trajectories and decoupling optimization across specialized agents enhances tool-augmented reasoning tasks.