基于大语言模型的多智能体系统强化学习:通过编排轨迹实现
Reinforcement Learning for LLM-based Multi-Agent Systems through Orchestration Traces
May 4, 2026
作者: Chenchen Zhang
cs.AI
摘要
随着大语言模型(LLM)智能体从孤立工具使用者发展为协同团队,强化学习(RL)不仅需要优化个体行动,还需统筹工作的生成、委派、通信、聚合与终止机制。本文通过编排轨迹——一种包含子智能体生成、任务委派、通信交互、工具调用、结果返回、信息聚合及终止决策等事件的时间交互图——来研究基于LLM的多智能体系统强化学习。
基于此视角,我们识别出三大技术维度:首先,奖励设计涵盖八大类别,包括针对并行加速、任务拆分正确性及聚合质量的编排奖励;其次,奖励与信用信号附着于从词元到团队层级的八类信号承载单元,在我们整理的文献池中,显式的反事实消息级信用分配机制尤为稀缺;第三,编排学习可分解为五项子决策:生成时机、委派对象、通信策略、聚合方式及终止判定。截至2026年5月4日,在我们的文献池中尚未发现针对终止决策的显式强化学习训练方法。
我们将学界方法与来自Kimi智能体集群、OpenAI Codex及Anthropic Claude Code的工业实践证据相关联,发现存在的规模差距主要体现在公开部署规模与学术评估体系之间的不匹配,而非对工业训练轨迹的独立验证。我们在https://github.com/xxzcc/awesome-llm-mas-rl 开源了项目资源,包含84篇标注文献池、32条排除记录、脚本化语料统计工具,以及支持可重现编排轨迹的最小化JSON模式定义。
English
As large language model (LLM) agents evolve from isolated tool users into coordinated teams, reinforcement learning (RL) must optimize not only individual actions but also how work is spawned, delegated, communicated, aggregated, and stopped. This paper studies RL for LLM-based multi-agent systems through orchestration traces: temporal interaction graphs whose events include sub-agent spawning, delegation, communication, tool use, return, aggregation, and stopping decisions.
Using this lens, we identify three technical axes. First, reward design spans eight families, including orchestration rewards for parallelism speedup, split correctness, and aggregation quality. Second, reward and credit signals attach to eight credit- or signal-bearing units from token to team; explicit counterfactual message-level credit remains especially sparse in our curated pool. Third, orchestration learning decomposes into five sub-decisions: when to spawn, whom to delegate to, how to communicate, how to aggregate, and when to stop. In our curated pool as of May 4, 2026, we found no explicit RL training method for the stopping decision.
We connect academic methods to public industrial evidence from Kimi Agent Swarm, OpenAI Codex, and Anthropic Claude Code. The resulting scale gap is a gap between publicly reported deployment envelopes and open academic evaluation regimes, not independent verification of industrial training traces. We release the artifact at https://github.com/xxzcc/awesome-llm-mas-rl, including an 84-entry tagged paper pool, a 32-record exclusion log, scripted corpus statistics, and a minimal JSON schema for replayable orchestration traces.