CoMAS：基于交互奖励的协同进化多智能体系统

摘要

自我进化是推动基于大规模语言模型（LLM）的智能体在预训练后持续提升能力的一个核心研究课题。近期研究见证了从无强化学习（RL）方法向基于RL方法的转变。当前的基于RL的方法要么依赖于密集的外部奖励信号，要么从LLM自身提取内在奖励信号。然而，这些方法与人智中观察到的自我进化机制存在偏差，后者中个体通过相互讨论与协作来学习与进步。本研究中，我们引入了协同进化多智能体系统（CoMAS），这是一个新颖的框架，它使得智能体能够在无外部监督的情况下，通过智能体间的交互学习来自主提升。CoMAS从丰富的讨论动态中生成内在奖励，采用LLM作为评判者的机制来构建这些奖励，并通过RL优化每个智能体的策略，从而实现去中心化且可扩展的协同进化。实验结果表明，CoMAS在多数评估设置中均优于未经训练的智能体，并达到了最先进的性能。消融研究证实了基于交互的奖励信号的必要性，并揭示了随着智能体数量与多样性的增加，系统展现出良好的可扩展性。这些发现确立了CoMAS作为LLM基智能体自我进化的一种新颖且有效的范式。

English

Self-evolution is a central research topic in enabling large language model (LLM)-based agents to continually improve their capabilities after pretraining. Recent research has witnessed a transition from reinforcement learning (RL)-free to RL-based methods. Current RL-based methods either rely on dense external reward signals or extract intrinsic reward signals from LLMs themselves. However, these approaches diverge from the self-evolution mechanisms observed in human intelligence, where individuals learn and improve through mutual discussion and collaboration. In this work, we introduce Co-Evolving Multi-Agent Systems (CoMAS), a novel framework that enables agents to improve autonomously by learning from inter-agent interactions without external supervision. CoMAS generates intrinsic rewards from rich discussion dynamics, employs an LLM-as-a-judge mechanism to formulate these rewards, and optimizes each agent's policy through RL, thereby enabling decentralized and scalable co-evolution. Experimental results demonstrate that CoMAS consistently outperforms untrained agents and achieves state-of-the-art performance across most evaluation settings. Ablation studies confirm the necessity of interaction-based reward signals and reveal promising scalability as the number and diversity of agents increase. These findings establish CoMAS as a novel and effective paradigm for self-evolution in LLM-based agents.

CoMAS：基于交互奖励的协同进化多智能体系统

CoMAS: Co-Evolving Multi-Agent Systems via Interaction Rewards

摘要

Support