共力更强:基于策略的强化学习在协作型大语言模型中的应用
Stronger Together: On-Policy Reinforcement Learning for Collaborative LLMs
October 13, 2025
作者: Yujie Zhao, Lanxiang Hu, Yang Wang, Minmin Hou, Hao Zhang, Ke Ding, Jishen Zhao
cs.AI
摘要
多智能体系统(MAS)与强化学习(RL)被广泛用于增强大型语言模型(LLMs)的代理能力。MAS通过基于角色的编排提升任务执行效率,而RL则利用环境奖励学习更优策略,如GRPO式优化。然而,将在线策略RL应用于MAS仍属探索不足的领域,并面临独特挑战。算法层面,标准的GRPO分组假设因提示随角色及轮次变化而失效。系统层面,训练栈需支持MAS工作流程的展开及对单策略与多策略模型的在线策略更新。
我们提出AT-GRPO,其包含(i)专为MAS设计的智能体与轮次分组RL算法,以及(ii)支持单策略与多策略模式的训练系统。在游戏、规划、编程及数学任务中,AT-GRPO均带来显著提升。在长期规划任务上,它将单智能体RL基线14.0%至47.0%的准确率提升至96.0%至99.5%。同时,它亦改善了推理表现,在编程任务上平均提升3.87%至7.62%,在数学任务上提升9.0%至17.93%。代码与环境可访问:https://github.com/pettingllms-ai/PettingLLMs。
English
Multi-agent systems (MAS) and reinforcement learning (RL) are widely used to
enhance the agentic capabilities of large language models (LLMs). MAS improves
task performance through role-based orchestration, while RL uses environmental
rewards to learn stronger policies, such as GRPO-style optimization. However,
applying on-policy RL to MAS remains underexplored and presents unique
challenges. Algorithmically, standard GRPO grouping assumptions break down
because prompts vary by role and by turn. System-wise, the training stack must
support MAS-workflow rollouts and on-policy updates for both single-policy and
multi-policy models.
We propose AT-GRPO, which includes (i) an agent- and turn-wise grouped RL
algorithm tailored to MAS and (ii) a training system that supports both single-
and multi-policy regimes. Across game, planning, coding, and math tasks,
AT-GRPO delivers substantial gains. On long-horizon planning, it increases
accuracy from a 14.0 to 47.0 percent single-agent RL baseline to 96.0 to 99.5
percent. It also improves reasoning performance, with average gains of 3.87 to
7.62 percent on coding tasks and 9.0 to 17.93 percent on math. Code and
environments are available at: https://github.com/pettingllms-ai/PettingLLMs.