arXiv: 2510.23595v1
多智能体进化:通过协同进化实现大语言模型的自我提升
Multi-Agent Evolve: LLM Self-Improve through Co-evolution
October 27, 2025
作者: Yixing Chen, Yiding Wang, Siqi Zhu, Haofei Yu, Tao Feng, Muhan Zhan, Mostofa Patwary, Jiaxuan You
cs.AIcs.AI
摘要
強化學習(Reinforcement Learning, RL)在提升大型語言模型(Large Language Models, LLMs)的推理能力方面展現了顯著潛力。然而,RL在LLMs中的成功很大程度上依賴於人工整理的數據集和可驗證的獎勵機制,這限制了其可擴展性和通用性。近期,受遊戲和圍棋領域成功啟發的自我對弈RL方法,旨在無需人工標註數據的情況下增強LLM的推理能力。然而,這些方法主要依賴於提供反饋的具體環境(如Python解釋器或遊戲引擎),將其擴展至通用領域仍具挑戰性。為應對這些挑戰,我們提出了多智能體進化框架(Multi-Agent Evolve, MAE),該框架使LLMs能夠在解決多樣化任務(包括數學、推理及常識問答)中自我進化。MAE的核心設計基於由單一LLM實例化的三種交互智能體(提問者、解答者、評判者),並應用強化學習來優化其行為。提問者生成問題,解答者嘗試解決,評判者則在共同進化中對二者進行評估。在Qwen2.5-3B-Instruct上的實驗表明,MAE在多個基準測試中平均提升了4.54%。這些結果凸顯了MAE作為一種可擴展、數據高效的方法,在最小化依賴人工監督的情況下,有效增強了LLMs的通用推理能力。
English
Reinforcement Learning (RL) has demonstrated significant potential in enhancing the reasoning capabilities of large language models (LLMs). However, the success of RL for LLMs heavily relies on human-curated datasets and verifiable rewards, which limit their scalability and generality. Recent Self-Play RL methods, inspired by the success of the paradigm in games and Go, aim to enhance LLM reasoning capabilities without human-annotated data. However, their methods primarily depend on a grounded environment for feedback (e.g., a Python interpreter or a game engine); extending them to general domains remains challenging. To address these challenges, we propose Multi-Agent Evolve (MAE), a framework that enables LLMs to self-evolve in solving diverse tasks, including mathematics, reasoning, and general knowledge Q&A. The core design of MAE is based on a triplet of interacting agents (Proposer, Solver, Judge) that are instantiated from a single LLM, and applies reinforcement learning to optimize their behaviors. The Proposer generates questions, the Solver attempts solutions, and the Judge evaluates both while co-evolving. Experiments on Qwen2.5-3B-Instruct demonstrate that MAE achieves an average improvement of 4.54% on multiple benchmarks. These results highlight MAE as a scalable, data-efficient method for enhancing the general reasoning abilities of LLMs with minimal reliance on human-curated supervision.