MAS博士:多智能体大语言模型系统的稳定强化学习
Dr. MAS: Stable Reinforcement Learning for Multi-Agent LLM Systems
February 9, 2026
作者: Lang Feng, Longtao Zheng, Shuo He, Fuxiang Zhang, Bo An
cs.AI
摘要
多智能体大语言模型系统通过角色分工实现高级推理与工具调用,但其可靠的强化学习后训练仍具挑战。本文从理论层面指出,将分组强化学习扩展至多智能体系统时,训练不稳定的核心原因在于:基于GRPO的优化过程中,全局归一化基线可能偏离异构智能体的奖励分布,最终引发梯度范数失稳。基于此发现,我们提出Dr. MAS——一种简洁稳定的多智能体大语言模型强化学习训练方案。该方法采用智能体级修正策略:利用各智能体自身的奖励统计量对优势函数进行独立归一化,从而校准梯度尺度,在理论与实证层面显著提升训练稳定性。除算法创新外,Dr. MAS构建了端到端的多智能体强化学习训练框架,支持可扩展的系统编排、灵活的智能体级模型服务与优化配置,以及大语言模型执行后端的资源共享调度。基于Qwen2.5和Qwen3系列模型在多智能体数学推理与多轮搜索基准上的实验表明,Dr. MAS相较原始GRPO实现显著提升(如数学任务平均指标提升5.6%、通过率提升4.6%,搜索任务平均指标提升15.2%、通过率提升13.1%),同时基本消除梯度尖峰现象。此外,该方法在异构智能体模型分配场景下仍保持高效,并进一步提升系统效率。
English
Multi-agent LLM systems enable advanced reasoning and tool use via role specialization, yet reliable reinforcement learning (RL) post-training for such systems remains difficult. In this work, we theoretically pinpoint a key reason for training instability when extending group-based RL to multi-agent LLM systems. We show that under GRPO-style optimization, a global normalization baseline may deviate from diverse agents' reward distributions, which ultimately leads to gradient-norm instability. Based on this finding, we propose Dr. MAS, a simple and stable RL training recipe for multi-agent LLM systems. Dr. MAS uses an agent-wise remedy: normalizing advantages per agent using each agent's own reward statistics, which calibrates gradient scales and dramatically stabilizes training, both theoretically and empirically. Beyond the algorithm, Dr. MAS provides an end-to-end RL training framework for multi-agent LLM systems, supporting scalable orchestration, flexible per-agent LLM serving and optimization configs, and shared resource scheduling of LLM actor backends. We evaluate Dr. MAS on multi-agent math reasoning and multi-turn search benchmarks using Qwen2.5 and Qwen3 series models. Dr. MAS achieves clear gains over vanilla GRPO (e.g., +5.6\% avg@16 and +4.6\% pass@16 on math, and +15.2\% avg@16 and +13.1\% pass@16 on search) while largely eliminating gradient spikes. Moreover, it remains highly effective under heterogeneous agent-model assignments while improving efficiency.