多智能体深度研究：基于M-GRPO的多智能体系统训练

摘要

多智能体系统在通用推理任务中表现出色，但在专业领域的训练缺失限制了其准确性。现有训练方法为系统中所有智能体训练统一的大语言模型（LLM），由于不同智能体底层数据分布的差异，这种模式可能限制系统性能。因此，下一代解决方案应转向基于异构大语言模型的智能体训练。然而该方法会引入新的优化挑战：智能体运行频率各异、决策过程中子智能体调用次数不等，且智能体常部署于独立服务器导致端到端梯度流中断。针对这些问题，我们提出M-GRPO——面向垂直多智能体系统（含主控智能体与多轮工具执行子智能体）的群组相对策略优化分层扩展算法。M-GRPO通过计算主控与子智能体的群组相对优势度，实现分层信用分配；同时引入轨迹对齐机制，在可变子智能体调用情况下生成定长批处理数据。我们部署了去耦合训练管道，各智能体在独立服务器运行，仅通过共享存储交换最小统计量，无需跨服务器反向传播即可实现可扩展训练。在真实场景基准测试（如GAIA、XBench-DeepSearch和WebWalkerQA）中，M-GRPO始终优于单智能体GRPO及冻结子智能体的多智能体GRPO，展现出更优的稳定性和样本效率。结果表明，对齐异构轨迹与解耦专业化智能体优化能有效增强工具增强型推理任务性能。

English

Multi-agent systems perform well on general reasoning tasks. However, the lack of training in specialized areas hinders their accuracy. Current training methods train a unified large language model (LLM) for all agents in the system. This may limit the performances due to different distributions underlying for different agents. Therefore, training multi-agent systems with distinct LLMs should be the next step to solve. However, this approach introduces optimization challenges. For example, agents operate at different frequencies, rollouts involve varying sub-agent invocations, and agents are often deployed across separate servers, disrupting end-to-end gradient flow. To address these issues, we propose M-GRPO, a hierarchical extension of Group Relative Policy Optimization designed for vertical Multi-agent systems with a main agent (planner) and multiple sub-agents (multi-turn tool executors). M-GRPO computes group-relative advantages for both main and sub-agents, maintaining hierarchical credit assignment. It also introduces a trajectory-alignment scheme that generates fixed-size batches despite variable sub-agent invocations. We deploy a decoupled training pipeline in which agents run on separate servers and exchange minimal statistics via a shared store. This enables scalable training without cross-server backpropagation. In experiments on real-world benchmarks (e.g., GAIA, XBench-DeepSearch, and WebWalkerQA), M-GRPO consistently outperforms both single-agent GRPO and multi-agent GRPO with frozen sub-agents, demonstrating improved stability and sample efficiency. These results show that aligning heterogeneous trajectories and decoupling optimization across specialized agents enhances tool-augmented reasoning tasks.

多智能体深度研究：基于M-GRPO的多智能体系统训练

Multi-Agent Deep Research: Training Multi-Agent Systems with M-GRPO

摘要

Support