ChatPaper.aiChatPaper

OpenSIR:开放式自我改进推理系统

OpenSIR: Open-Ended Self-Improving Reasoner

November 1, 2025
作者: Wai-Chung Kwan, Joshua Ong Jun Leang, Pavlos Vougiouklis, Jeff Z. Pan, Marco Valentino, Pasquale Minervini
cs.AI

摘要

基於強化學習的大語言模型推理技術近期取得進展,但其依賴可驗證獎勵的註釋數據集,這可能限制模型超越人類水平的能力。儘管自我博弈提供了一種前景廣闊的替代方案,現有方法仍需依賴外部驗證器或無法實現開放式學習。我們提出開放式自我改進推理器(OpenSIR),該框架通過交替扮演教師與學生角色,使大語言模型在無外部監督的情況下學習生成並解決新問題。為創造新穎問題,OpenSIR同步優化難度與多樣性:既獎勵能帶來適當挑戰的問題,又探索不同概念領域,從而實現開放式數學發現。從單個簡單種子問題出發,OpenSIR顯著提升指令模型性能——Llama-3.2-3B-Instruct在GSM8K上的準確率從73.9%提升至78.3%,在大學數學題集上從28.8%提升至34.4%;Gemma-2-2B-Instruct在GSM8K上從38.5%躍升至58.7%。分析表明,OpenSIR通過協同演化的師生角色實現開放式學習,這種動態關係能自適應校準難度並驅動多樣化探索,使模型從基礎數學自主進階至高階數學領域。
English
Recent advances in large language model (LLM) reasoning through reinforcement learning rely on annotated datasets for verifiable rewards, which may limit models' ability to surpass human-level performance. While self-play offers a promising alternative, existing approaches depend on external verifiers or cannot learn open-endedly. We present Open-Ended Self-Improving Reasoner (OpenSIR), a self-play framework where an LLM learns to generate and solve novel problems by alternating teacher and student roles without external supervision. To generate novel problems, OpenSIR optimises for both difficulty and diversity, rewarding problems that challenge appropriately while exploring distinct concepts, enabling open-ended mathematical discovery. Starting from a single trivial seed problem, OpenSIR substantially improves instruction models: Llama-3.2-3B-Instruct advances from 73.9 to 78.3 on GSM8K, and from 28.8 to 34.4 on College Math, while Gemma-2-2B-Instruct rises from 38.5 to 58.7 on GSM8K. Our analyses reveal that OpenSIR achieves open-ended learning through co-evolving teacher-student roles that adaptively calibrate difficulty and drive diverse exploration, progressing autonomously from basic to advanced mathematics.
PDF201January 19, 2026