多智能体进化：通过协同进化实现大语言模型的自我提升

摘要

强化学习（Reinforcement Learning, RL）在提升大语言模型（Large Language Models, LLMs）的推理能力方面展现出显著潜力。然而，RL在LLMs中的成功很大程度上依赖于人工整理的数据集和可验证的奖励机制，这限制了其扩展性和通用性。近期，受游戏和围棋领域成功范式启发的自我对弈RL方法，旨在无需人工标注数据的情况下增强LLM的推理能力。然而，这些方法主要依赖于有基础环境提供的反馈（如Python解释器或游戏引擎），将其推广至通用领域仍面临挑战。为应对这些挑战，我们提出了多智能体进化框架（Multi-Agent Evolve, MAE），该框架使LLMs能够在解决多样化任务（包括数学、推理及常识问答）中自我进化。MAE的核心设计基于由单一LLM实例化的三个交互智能体（提议者、求解者、评判者），并应用强化学习优化其行为。提议者生成问题，求解者尝试解答，评判者则对两者进行评估并共同进化。在Qwen2.5-3B-Instruct模型上的实验表明，MAE在多个基准测试中平均提升了4.54%。这些结果凸显了MAE作为一种可扩展、数据高效的方法，在最小化依赖人工监督的情况下，有效增强了LLMs的通用推理能力。

English

Reinforcement Learning (RL) has demonstrated significant potential in enhancing the reasoning capabilities of large language models (LLMs). However, the success of RL for LLMs heavily relies on human-curated datasets and verifiable rewards, which limit their scalability and generality. Recent Self-Play RL methods, inspired by the success of the paradigm in games and Go, aim to enhance LLM reasoning capabilities without human-annotated data. However, their methods primarily depend on a grounded environment for feedback (e.g., a Python interpreter or a game engine); extending them to general domains remains challenging. To address these challenges, we propose Multi-Agent Evolve (MAE), a framework that enables LLMs to self-evolve in solving diverse tasks, including mathematics, reasoning, and general knowledge Q&A. The core design of MAE is based on a triplet of interacting agents (Proposer, Solver, Judge) that are instantiated from a single LLM, and applies reinforcement learning to optimize their behaviors. The Proposer generates questions, the Solver attempts solutions, and the Judge evaluates both while co-evolving. Experiments on Qwen2.5-3B-Instruct demonstrate that MAE achieves an average improvement of 4.54% on multiple benchmarks. These results highlight MAE as a scalable, data-efficient method for enhancing the general reasoning abilities of LLMs with minimal reliance on human-curated supervision.

多智能体进化：通过协同进化实现大语言模型的自我提升

Multi-Agent Evolve: LLM Self-Improve through Co-evolution

摘要

Support