多智能体进化:通过协同演化实现大语言模型的自我优化
Multi-Agent Evolve: LLM Self-Improve through Co-evolution
October 27, 2025
作者: Yixing Chen, Yiding Wang, Siqi Zhu, Haofei Yu, Tao Feng, Muhan Zhan, Mostofa Patwary, Jiaxuan You
cs.AI
摘要
強化學習(RL)在提升大型語言模型(LLM)的推理能力方面已展現顯著潛力。然而,強化學習在LLM中的成功高度依賴人工標註數據集和可驗證獎勵機制,這限制了其擴展性與通用性。受該範式在棋類遊戲中成功的啟發,近期提出的自我對弈式強化學習方法旨在無需人工標註數據的情況下增強LLM的推理能力。但現有方法主要依賴具備明確反饋的實體環境(如Python解釋器或遊戲引擎),將其擴展至通用領域仍面臨挑戰。為解決這些難題,我們提出多智能體進化框架(MAE),該框架使LLM能夠在數學計算、邏輯推理和常識問答等多樣化任務中實現自我進化。MAE的核心設計基於由單一LLM實例化的三智能體交互架構(提問者、求解者、評判者),並應用強化學習優化其行為:提問者生成問題,求解者嘗試解答,評判者則在協同進化過程中對二者進行評估。基於Qwen2.5-3B-Instruct模型的實驗表明,MAE在多個基準測試中實現平均4.54%的性能提升。這些結果凸顯了MAE作為一種可擴展、高數據效率的方法,能夠在極少依賴人工監督的前提下有效增強LLM的通用推理能力。
English
Reinforcement Learning (RL) has demonstrated significant potential in
enhancing the reasoning capabilities of large language models (LLMs). However,
the success of RL for LLMs heavily relies on human-curated datasets and
verifiable rewards, which limit their scalability and generality. Recent
Self-Play RL methods, inspired by the success of the paradigm in games and Go,
aim to enhance LLM reasoning capabilities without human-annotated data.
However, their methods primarily depend on a grounded environment for feedback
(e.g., a Python interpreter or a game engine); extending them to general
domains remains challenging. To address these challenges, we propose
Multi-Agent Evolve (MAE), a framework that enables LLMs to self-evolve in
solving diverse tasks, including mathematics, reasoning, and general knowledge
Q&A. The core design of MAE is based on a triplet of interacting agents
(Proposer, Solver, Judge) that are instantiated from a single LLM, and applies
reinforcement learning to optimize their behaviors. The Proposer generates
questions, the Solver attempts solutions, and the Judge evaluates both while
co-evolving. Experiments on Qwen2.5-3B-Instruct demonstrate that MAE achieves
an average improvement of 4.54% on multiple benchmarks. These results highlight
MAE as a scalable, data-efficient method for enhancing the general reasoning
abilities of LLMs with minimal reliance on human-curated supervision.