ChatPaper.aiChatPaper

在最少人工监督下引导大型语言模型实现自我进化

Guided Self-Evolving LLMs with Minimal Human Supervision

December 2, 2025
作者: Wenhao Yu, Zhenwen Liang, Chengsong Huang, Kishan Panaganti, Tianqing Fang, Haitao Mi, Dong Yu
cs.AI

摘要

人工智能的自我进化长期被视为通往超智能的路径,即模型能够从自身学习经验中自主获取、优化并内化知识。然而在实践中,无引导的自进化系统往往很快陷入平台期,甚至随着训练进程出现性能退化。这些失败源于概念漂移、多样性崩溃和错误进化等问题——模型不断强化自身偏见并收敛至低熵行为。为实现稳定可控的自进化并最小化对人类监督的依赖,我们提出R-Few框架:一种融合轻量化人类监督的引导式自我博弈挑战者-求解器系统,通过情境 grounding 与混合训练实现协同进化。在每轮迭代中,挑战者基于少量人工标注样本引导生成合成问题,而求解器则依据在线难度课程同时学习人类样本与合成样本。在数学与通用推理基准测试中,R-Few实现了持续迭代提升。例如Qwen3-8B-Base模型在数学任务上较R-Zero提升3.0分,其表现与训练数据量20倍于己的General-Reasoner模型持平。消融实验证实了基于情境的挑战者训练与课程化求解器训练的互补性,进一步分析表明R-Few能有效抑制概念漂移,产生更稳定可控的协同进化动态。
English
AI self-evolution has long been envisioned as a path toward superintelligence, where models autonomously acquire, refine, and internalize knowledge from their own learning experiences. Yet in practice, unguided self-evolving systems often plateau quickly or even degrade as training progresses. These failures arise from issues such as concept drift, diversity collapse, and mis-evolution, as models reinforce their own biases and converge toward low-entropy behaviors. To enable models to self-evolve in a stable and controllable manner while minimizing reliance on human supervision, we introduce R-Few, a guided Self-Play Challenger-Solver framework that incorporates lightweight human oversight through in-context grounding and mixed training. At each iteration, the Challenger samples a small set of human-labeled examples to guide synthetic question generation, while the Solver jointly trains on human and synthetic examples under an online, difficulty-based curriculum. Across math and general reasoning benchmarks, R-Few achieves consistent and iterative improvements. For example, Qwen3-8B-Base improves by +3.0 points over R-Zero on math tasks and achieves performance on par with General-Reasoner, despite the latter being trained on 20 times more human data. Ablation studies confirm the complementary contributions of grounded challenger training and curriculum-based solver training, and further analysis shows that R-Few mitigates drift, yielding more stable and controllable co-evolutionary dynamics.
PDF371December 4, 2025