基於最少人工監督的引導式自我進化大型語言模型
Guided Self-Evolving LLMs with Minimal Human Supervision
December 2, 2025
作者: Wenhao Yu, Zhenwen Liang, Chengsong Huang, Kishan Panaganti, Tianqing Fang, Haitao Mi, Dong Yu
cs.AI
摘要
長期以來,AI自我演化被視為實現超智能的途徑,即模型能從自身學習經驗中自主獲取、精煉並內化知識。然而在實踐中,未經引導的自我演化系統往往會快速陷入瓶頸,甚至隨著訓練進程出現性能衰退。這些失敗源於概念漂移、多樣性崩潰與錯誤演化等問題——模型在強化自身偏差的同時,會逐漸收斂至低熵行為。為實現穩定可控的自我演化並降低對人工監督的依賴,我們提出R-Few框架:一種融合情境校準與混合訓練的輕量級引導式自我博弈挑戰者-求解器架構。在每輪迭代中,挑戰者通過少量人工標註樣本引導合成問題生成,而求解器則基於線上難度課程,對人工與合成樣本進行聯合訓練。在數學與通用推理基準測試中,R-Few實現了持續的迭代提升。例如Qwen3-8B-Base模型在數學任務上較R-Zero提升3.0分,且性能與使用20倍人工數據訓練的General-Reasoner模型持平。消融實驗驗證了基於情境校準的挑戰者訓練與課程化求解器訓練的互補效應,進一步分析表明R-Few能有效抑制概念漂移,產生更穩定可控的協同演化動態。
English
AI self-evolution has long been envisioned as a path toward superintelligence, where models autonomously acquire, refine, and internalize knowledge from their own learning experiences. Yet in practice, unguided self-evolving systems often plateau quickly or even degrade as training progresses. These failures arise from issues such as concept drift, diversity collapse, and mis-evolution, as models reinforce their own biases and converge toward low-entropy behaviors. To enable models to self-evolve in a stable and controllable manner while minimizing reliance on human supervision, we introduce R-Few, a guided Self-Play Challenger-Solver framework that incorporates lightweight human oversight through in-context grounding and mixed training. At each iteration, the Challenger samples a small set of human-labeled examples to guide synthetic question generation, while the Solver jointly trains on human and synthetic examples under an online, difficulty-based curriculum. Across math and general reasoning benchmarks, R-Few achieves consistent and iterative improvements. For example, Qwen3-8B-Base improves by +3.0 points over R-Zero on math tasks and achieves performance on par with General-Reasoner, despite the latter being trained on 20 times more human data. Ablation studies confirm the complementary contributions of grounded challenger training and curriculum-based solver training, and further analysis shows that R-Few mitigates drift, yielding more stable and controllable co-evolutionary dynamics.