OpenSIR:开放式自我改进推理系统
OpenSIR: Open-Ended Self-Improving Reasoner
November 1, 2025
作者: Wai-Chung Kwan, Joshua Ong Jun Leang, Pavlos Vougiouklis, Jeff Z. Pan, Marco Valentino, Pasquale Minervini
cs.AI
摘要
基于强化学习的大语言模型推理技术近期进展依赖于带标注数据集的可验证奖励机制,这可能会限制模型超越人类水平的能力。尽管自我博弈提供了一种前景广阔的替代方案,但现有方法要么依赖外部验证器,要么无法实现开放式学习。我们提出开放式自我提升推理器(OpenSIR),该框架通过交替扮演教师与学生角色,使大语言模型在无外部监督条件下学习生成并解决新型问题。为生成新颖问题,OpenSIR同步优化难度与多样性:既奖励能带来适度挑战的问题,又鼓励探索不同概念,从而实现开放式数学发现。从单个简单种子问题出发,OpenSIR显著提升了指令模型的性能:Llama-3.2-3B-Instruct在GSM8K上的准确率从73.9提升至78.3,在大学数学题集上从28.8提升至34.4;Gemma-2-2B-Instruct在GSM8K上从38.5跃升至58.7。分析表明,OpenSIR通过协同进化的师生角色实现开放式学习——自适应校准难度并驱动多样化探索,从而自主完成从基础到高等数学的能力进阶。
English
Recent advances in large language model (LLM) reasoning through reinforcement
learning rely on annotated datasets for verifiable rewards, which may limit
models' ability to surpass human-level performance. While self-play offers a
promising alternative, existing approaches depend on external verifiers or
cannot learn open-endedly. We present Open-Ended Self-Improving Reasoner
(OpenSIR), a self-play framework where an LLM learns to generate and solve
novel problems by alternating teacher and student roles without external
supervision. To generate novel problems, OpenSIR optimises for both difficulty
and diversity, rewarding problems that challenge appropriately while exploring
distinct concepts, enabling open-ended mathematical discovery. Starting from a
single trivial seed problem, OpenSIR substantially improves instruction models:
Llama-3.2-3B-Instruct advances from 73.9 to 78.3 on GSM8K, and from 28.8 to
34.4 on College Math, while Gemma-2-2B-Instruct rises from 38.5 to 58.7 on
GSM8K. Our analyses reveal that OpenSIR achieves open-ended learning through
co-evolving teacher-student roles that adaptively calibrate difficulty and
drive diverse exploration, progressing autonomously from basic to advanced
mathematics.