深度自我演化推理
Deep Self-Evolving Reasoning
October 20, 2025
作者: Zihan Liu, Shun Zheng, Xumeng Wen, Yang Wang, Jiang Bian, Mao Yang
cs.AI
摘要
長鏈式思維推理已成為大型語言模型中高級推理的基石。儘管近期的驗證-精煉框架使專有模型能夠解決奧林匹克競賽級別的難題,但其有效性依賴於強大且可靠的驗證與修正能力,這在開放權重、小規模模型中仍顯脆弱。本研究證明,即使在面對困難任務時僅具備較弱的驗證與精煉能力,通過我們稱之為深度自我演化推理(DSER)的概率範式,此類模型的推理極限仍可被大幅拓展。我們將迭代推理概念化為馬爾可夫鏈,其中每一步代表解空間中的隨機轉移。關鍵洞見在於,只要改進的概率略微超過退化的概率,收斂至正確解便得到保證。通過並行運行多個長時程、自我演化的過程,DSER放大了這些微小的正向趨勢,使模型能夠漸進地逼近正確答案。實證上,我們將DSER應用於DeepSeek-R1-0528-Qwen3-8B模型。在具有挑戰性的AIME 2024-2025基準測試中,DSER解決了9個先前無法解決的問題中的5個,並提升了整體表現,使這一緊湊模型通過多數表決超越了其600B參數教師的單次推理準確率。除了在測試時擴展的即時效用外,DSER框架還用於診斷當前開放權重推理器的根本限制。通過清晰界定它們在自我驗證、精煉和穩定性方面的不足,我們的研究為開發具備強大內在自我演化能力的下一代模型確立了明確的研究議程。
English
Long-form chain-of-thought reasoning has become a cornerstone of advanced
reasoning in large language models. While recent verification-refinement
frameworks have enabled proprietary models to solve Olympiad-level problems,
their effectiveness hinges on strong, reliable verification and correction
capabilities, which remain fragile in open-weight, smaller-scale models. This
work demonstrates that even with weak verification and refinement capabilities
on hard tasks, the reasoning limits of such models can be substantially
extended through a probabilistic paradigm we call Deep Self-Evolving Reasoning
(DSER). We conceptualize iterative reasoning as a Markov chain, where each step
represents a stochastic transition in the solution space. The key insight is
that convergence to a correct solution is guaranteed as long as the probability
of improvement marginally exceeds that of degradation. By running multiple
long-horizon, self-evolving processes in parallel, DSER amplifies these small
positive tendencies, enabling the model to asymptotically approach correct
answers. Empirically, we apply DSER to the DeepSeek-R1-0528-Qwen3-8B model. On
the challenging AIME 2024-2025 benchmark, DSER solves 5 out of 9 previously
unsolvable problems and boosts overall performance, enabling this compact model
to surpass the single-turn accuracy of its 600B-parameter teacher through
majority voting. Beyond its immediate utility for test-time scaling, the DSER
framework serves to diagnose the fundamental limitations of current open-weight
reasoners. By clearly delineating their shortcomings in self-verification,
refinement, and stability, our findings establish a clear research agenda for
developing next-generation models with powerful, intrinsic self-evolving
capabilities.