多智能體強化學習何時能改善LLM工作流？工作流、規模與策略共享的權衡

摘要

多智能体LLM工作流通过将推理过程分配给专门角色来提升终端任务准确率，但采用强化学习联合训练这些角色时存在稳定性问题，其原因尚不明确。我们研究了端到端RL训练多智能体LLM工作流相较于基模型的改进效果，比较了两种训练方式：共享策略训练（所有角色更新同一个策略）与隔离策略训练（每个角色拥有独立参数）。实验矩阵涵盖评估-优化、投票和编排-工作器三种工作流，数学与代码两类任务，以及三个模型规模（0.6B、1.7B、4B）。实验发现，多智能体RL通常能提升基模型性能，但提升幅度同时依赖于工作流、任务和模型规模，并非仅由策略共享决定。隔离策略训练往往能达到更高的峰值准确率，但更常遭遇终端准确率悬崖式下降；而共享策略训练并未消除失败，只是将失败重塑为性质不同的模式。我们进一步通过工作流拓扑结构和策略路由引发的角色级梯度动力学，解释了其中最显著的模式：在隔离策略训练下，共享提示的并行同角色智能体会放大各角色梯度，导致投票和编排-工作器工作流出现终端退化；在共享策略训练下，不对称的每步梯度质量使得共享策略被主导角色捕获，从而在不同任务和工作流中产生不同的失败特征。综合来看，实证图谱及其内在机制表明，策略共享是通过不同渠道引导训练压力，而非提供统一的稳定性，因此它是一种依赖于工作流和任务条件的折衷设计选择。

English

Multi-agent LLM workflows route inference through specialized roles to lift end-task accuracy, but jointly training those roles with reinforcement learning is unstable in ways that are poorly understood. We study when end-to-end RL training of multi-agent LLM workflows improves over their base models, comparing Shared-Policy training, where all roles update one policy, with Isolated-Policy training, where each role has its own parameters. Our experimental matrix spans Eval-Opt, Voting, and Orch-Workers workflows, math and code tasks, and three model scales (0.6B, 1.7B, 4B). We find that multi-agent RL usually improves over base models, but gains depend jointly on workflow, task, and scale, not on policy sharing alone. Isolated-Policy tends to reach higher peak accuracy yet more often falls off a terminal accuracy cliff, while Shared-Policy training does not eliminate failure; it redistributes failure into qualitatively different patterns. We then explain the strongest of these patterns through role-level gradient dynamics induced by workflow topology and policy routing: under Isolated-Policy, parallel same-role agents on shared prompts amplify per-role gradients and drive terminal degradation in Voting and Orch-Workers workflows; under Shared-Policy, asymmetric per-step gradient mass causes the shared policy to be captured by the dominant role, producing different failure signatures by task and workflow. Together, the empirical map and its underlying mechanisms show that policy sharing routes training pressure through different channels rather than offering uniform stability, making it a design choice with workflow- and task-conditional tradeoffs.