다중 에이전트 강화학습은 언제 LLM 워크플로우를 향상시키는가? 워크플로우, 규모, 그리고 정책 공유의 절충점

초록

멀티에이전트 LLM 워크플로우는 추론을 전문 역할로 분배하여 최종 작업 정확도를 향상시키지만, 강화 학습을 통해 이러한 역할들을 공동으로 훈련하는 과정은 이해가 부족한 방식으로 불안정합니다. 우리는 멀티에이전트 LLM 워크플로우의 종단간(end-to-end) RL 훈련이 기본 모델 대비 언제 향상되는지 연구하며, 모든 역할이 하나의 정책을 업데이트하는 공유 정책(Shared-Policy) 훈련과 각 역할이 고유한 파라미터를 가지는 분리 정책(Isolated-Policy) 훈련을 비교합니다. 실험 매트릭스는 Eval-Opt, Voting, Orch-Workers 워크플로우, 수학 및 코드 작업, 그리고 세 가지 모델 규모(0.6B, 1.7B, 4B)를 포괄합니다. 우리는 멀티에이전트 RL이 일반적으로 기본 모델보다 향상되지만, 그 이득은 정책 공유 자체가 아니라 워크플로우, 작업, 규모에 복합적으로 의존함을 발견했습니다. 분리 정책은 더 높은 최고 정확도에 도달하는 경향이 있지만 더 자주 최종 정확도 절벽(terminal accuracy cliff)에서 떨어지는 반면, 공유 정책 훈련은 실패를 제거하지 않습니다. 대신 실패를 질적으로 다른 패턴으로 재분배합니다. 그런 다음, 워크플로우 토폴로지와 정책 라우팅에 의해 유도된 역할 수준의 그래디언트 동역학(role-level gradient dynamics)을 통해 이러한 패턴 중 가장 강력한 것을 설명합니다. 분리 정책 하에서는, 공유 프롬프트에 대한 병렬 동일 역할 에이전트(parallel same-role agents)가 역할별 그래디언트를 증폭시켜 Voting 및 Orch-Workers 워크플로우에서 최종 성능 저하를 유발합니다. 공유 정책 하에서는, 비대칭적 단계별 그래디언트 질량(asymmetric per-step gradient mass)으로 인해 공유 정책이 지배적 역할(dominant role)에 포획되어 작업 및 워크플로우에 따라 서로 다른 실패 양상을 생성합니다. 종합하면, 실험적 지도와 그 기저 메커니즘은 정책 공유가 균일한 안정성을 제공하기보다는 훈련 압력을 다른 경로로 라우팅함을 보여주며, 이는 워크플로우 및 작업 조건에 따른 트레이드오프를 수반하는 설계 선택임을 시사합니다.

English

Multi-agent LLM workflows route inference through specialized roles to lift end-task accuracy, but jointly training those roles with reinforcement learning is unstable in ways that are poorly understood. We study when end-to-end RL training of multi-agent LLM workflows improves over their base models, comparing Shared-Policy training, where all roles update one policy, with Isolated-Policy training, where each role has its own parameters. Our experimental matrix spans Eval-Opt, Voting, and Orch-Workers workflows, math and code tasks, and three model scales (0.6B, 1.7B, 4B). We find that multi-agent RL usually improves over base models, but gains depend jointly on workflow, task, and scale, not on policy sharing alone. Isolated-Policy tends to reach higher peak accuracy yet more often falls off a terminal accuracy cliff, while Shared-Policy training does not eliminate failure; it redistributes failure into qualitatively different patterns. We then explain the strongest of these patterns through role-level gradient dynamics induced by workflow topology and policy routing: under Isolated-Policy, parallel same-role agents on shared prompts amplify per-role gradients and drive terminal degradation in Voting and Orch-Workers workflows; under Shared-Policy, asymmetric per-step gradient mass causes the shared policy to be captured by the dominant role, producing different failure signatures by task and workflow. Together, the empirical map and its underlying mechanisms show that policy sharing routes training pressure through different channels rather than offering uniform stability, making it a design choice with workflow- and task-conditional tradeoffs.