オーケストレーショントレースを用いたLLMベースマルチエージェントシステムのための強化学習

要旨

大規模言語モデル（LLM）エージェントが単独のツール利用者から協調的なチームへと進化するにつれ、強化学習（RL）は個々の行動だけでなく、作業の生成、委任、伝達、集約、終了の方法も最適化する必要が生じている。本論文では、LLMベースのマルチエージェントシステムにおけるRLを、オーケストレーション・トレース（時系列インタラクショングラフ）を通じて研究する。このグラフのイベントには、サブエージェントの生成、委任、通信、ツール利用、返却、集約、および停止判断が含まれる。この視点を通じて、我々は3つの技術的軸を特定する。第一に、報酬設計は8つのカテゴリに及び、並列化による高速化、分割の正確さ、集約の質に対するオーケストレーション報酬を含む。第二に、報酬と信用割り当ての信号は、トークンからチームに至る8つの信号伝達単位に付与される。特に、我々が収集した論文群では、明示的な反事実的メッセージレベルの信用割り当てが著しく不足している。第三に、オーケストレーション学習は、いつ生成するか、誰に委任するか、どのように通信するか、どのように集約するか、いつ停止するかという5つの下位決定に分解される。2026年5月4日現在の我々の収集論文群では、停止判断に対する明示的なRL訓練手法は見つからなかった。学術的手法を、Kimi Agent Swarm、OpenAI Codex、Anthropic Claude Codeといった公開された産業界の実証例と関連付ける。結果として生じるスケールギャップは、公表されたデプロイメント範囲と公開学術評価体制の間の隔たりであり、産業界の訓練トレースの独立検証ではない。我々は成果物をhttps://github.com/xxzcc/awesome-llm-mas-rl で公開しており、84件のタグ付き論文群、32件の除外記録ログ、スクリプト化されたコーパス統計、および再生可能なオーケストレーション・トレースのための最小限のJSONスキーマを含む。

English

As large language model (LLM) agents evolve from isolated tool users into coordinated teams, reinforcement learning (RL) must optimize not only individual actions but also how work is spawned, delegated, communicated, aggregated, and stopped. This paper studies RL for LLM-based multi-agent systems through orchestration traces: temporal interaction graphs whose events include sub-agent spawning, delegation, communication, tool use, return, aggregation, and stopping decisions. Using this lens, we identify three technical axes. First, reward design spans eight families, including orchestration rewards for parallelism speedup, split correctness, and aggregation quality. Second, reward and credit signals attach to eight credit- or signal-bearing units from token to team; explicit counterfactual message-level credit remains especially sparse in our curated pool. Third, orchestration learning decomposes into five sub-decisions: when to spawn, whom to delegate to, how to communicate, how to aggregate, and when to stop. In our curated pool as of May 4, 2026, we found no explicit RL training method for the stopping decision. We connect academic methods to public industrial evidence from Kimi Agent Swarm, OpenAI Codex, and Anthropic Claude Code. The resulting scale gap is a gap between publicly reported deployment envelopes and open academic evaluation regimes, not independent verification of industrial training traces. We release the artifact at https://github.com/xxzcc/awesome-llm-mas-rl, including an 84-entry tagged paper pool, a 32-record exclusion log, scripted corpus statistics, and a minimal JSON schema for replayable orchestration traces.

オーケストレーショントレースを用いたLLMベースマルチエージェントシステムのための強化学習

Reinforcement Learning for LLM-based Multi-Agent Systems through Orchestration Traces

要旨

Support