ChatPaper.aiChatPaper

基于大语言模型的多智能体系统:通过编排轨迹实现强化学习

Reinforcement Learning for LLM-based Multi-Agent Systems through Orchestration Traces

May 4, 2026
作者: Chenchen Zhang
cs.AI

摘要

随着大语言模型(LLM)智能体从孤立工具使用者发展为协同团队,强化学习(RL)不仅需要优化个体行动,还需统筹工作的生成、委派、通信、聚合与终止机制。本文通过编排轨迹——一种包含子智能体生成、任务委派、通信交互、工具调用、结果返回、信息聚合及终止决策等事件的时间交互图——来研究基于LLM的多智能体系统强化学习。 基于此视角,我们提出三大技术维度:首先,奖励设计涵盖八大类别,包括针对并行加速、拆分正确性与聚合质量的协同奖励机制;其次,奖励与信用信号可附着于从词元到团队层级的八类信号承载单元,而在我们整理的文献池中,显式的反事实消息级信用分配机制尤为稀缺;第三,协同学习可分解为五项子决策:生成时机、委派对象、通信策略、聚合方式及终止判定。截至2026年5月4日,在我们的文献池中尚未发现针对终止决策的显式强化学习方法。 我们将学术方法与Kimi智能体集群、OpenAI Codex及Anthropic Claude Code等工业界公开证据相关联,发现存在的规模差距主要体现在公开部署规模与开放学术评估体系之间,而非对工业训练轨迹的独立验证。我们于https://github.com/xxzcc/awesome-llm-mas-rl 开源相关资源,包括含84条标注的文献池、32条排除记录、脚本化语料统计工具,以及支持可重现编排轨迹的最小化JSON模式定义。
English
As large language model (LLM) agents evolve from isolated tool users into coordinated teams, reinforcement learning (RL) must optimize not only individual actions but also how work is spawned, delegated, communicated, aggregated, and stopped. This paper studies RL for LLM-based multi-agent systems through orchestration traces: temporal interaction graphs whose events include sub-agent spawning, delegation, communication, tool use, return, aggregation, and stopping decisions. Using this lens, we identify three technical axes. First, reward design spans eight families, including orchestration rewards for parallelism speedup, split correctness, and aggregation quality. Second, reward and credit signals attach to eight credit- or signal-bearing units from token to team; explicit counterfactual message-level credit remains especially sparse in our curated pool. Third, orchestration learning decomposes into five sub-decisions: when to spawn, whom to delegate to, how to communicate, how to aggregate, and when to stop. In our curated pool as of May 4, 2026, we found no explicit RL training method for the stopping decision. We connect academic methods to public industrial evidence from Kimi Agent Swarm, OpenAI Codex, and Anthropic Claude Code. The resulting scale gap is a gap between publicly reported deployment envelopes and open academic evaluation regimes, not independent verification of industrial training traces. We release the artifact at https://github.com/xxzcc/awesome-llm-mas-rl, including an 84-entry tagged paper pool, a 32-record exclusion log, scripted corpus statistics, and a minimal JSON schema for replayable orchestration traces.
PDF32May 7, 2026