マルチエージェントRLはいつLLMワークフローを改善するのか？：ワークフロー、規模、ポリシー共有のトレードオフ

要旨

マルチエージェントLLMワークフローは、推論を専門的な役割にルーティングすることで最終タスクの精度を向上させるが、強化学習を用いてそれらの役割を共同で訓練することは、理解が不十分な形で不安定である。本研究では、マルチエージェントLLMワークフローのエンドツーエンドRL訓練がベースモデルよりも改善される条件を調査し、すべての役割が1つのポリシーを更新する共有ポリシー訓練と、各役割が独自のパラメータを持つ分離ポリシー訓練を比較する。実験マトリックスは、Eval-Opt、Voting、Orch-Workersの各ワークフロー、数学およびコードタスク、3つのモデルスケール（0.6B、1.7B、4B）を網羅する。その結果、マルチエージェントRLは通常ベースモデルよりも改善されるが、その改善はポリシー共有のみに依存するのではなく、ワークフロー、タスク、スケールに共同で依存することが判明した。分離ポリシーは、より高いピーク精度に達する傾向がある一方で、末端精度の崖から落ちる頻度が高く、一方、共有ポリシー訓練は失敗を排除せず、失敗を質的に異なるパターンに再分配する。次に、これらのパターンのうち最も顕著なものを、ワークフロートポロジーとポリシールーティングによって誘発される役割レベルの勾配ダイナミクスを通じて説明する。分離ポリシーの下では、共有プロンプト上の並列な同一役割エージェントが役割ごとの勾配を増幅させ、VotingおよびOrch-Workersワークフローにおいて末端の劣化を引き起こす。共有ポリシーの下では、非対称なステップごとの勾配質量が、共有ポリシーを支配的な役割に捕捉させ、タスクとワークフローによって異なる失敗の兆候を生み出す。総合すると、経験的なマップとその根底にあるメカニズムは、ポリシー共有が一律の安定性を提供するのではなく、訓練圧力を異なるチャネルにルーティングすることを示しており、これはワークフローおよびタスクに条件付きのトレードオフを伴う設計上の選択となる。

English

Multi-agent LLM workflows route inference through specialized roles to lift end-task accuracy, but jointly training those roles with reinforcement learning is unstable in ways that are poorly understood. We study when end-to-end RL training of multi-agent LLM workflows improves over their base models, comparing Shared-Policy training, where all roles update one policy, with Isolated-Policy training, where each role has its own parameters. Our experimental matrix spans Eval-Opt, Voting, and Orch-Workers workflows, math and code tasks, and three model scales (0.6B, 1.7B, 4B). We find that multi-agent RL usually improves over base models, but gains depend jointly on workflow, task, and scale, not on policy sharing alone. Isolated-Policy tends to reach higher peak accuracy yet more often falls off a terminal accuracy cliff, while Shared-Policy training does not eliminate failure; it redistributes failure into qualitatively different patterns. We then explain the strongest of these patterns through role-level gradient dynamics induced by workflow topology and policy routing: under Isolated-Policy, parallel same-role agents on shared prompts amplify per-role gradients and drive terminal degradation in Voting and Orch-Workers workflows; under Shared-Policy, asymmetric per-step gradient mass causes the shared policy to be captured by the dominant role, producing different failure signatures by task and workflow. Together, the empirical map and its underlying mechanisms show that policy sharing routes training pressure through different channels rather than offering uniform stability, making it a design choice with workflow- and task-conditional tradeoffs.