深度自进化推理
Deep Self-Evolving Reasoning
October 20, 2025
作者: Zihan Liu, Shun Zheng, Xumeng Wen, Yang Wang, Jiang Bian, Mao Yang
cs.AI
摘要
长链式思维推理已成为大型语言模型高级推理的基石。尽管近期的验证-优化框架已使专有模型能够解决奥林匹克级别的难题,但其有效性依赖于强大且可靠的验证与修正能力,而这些能力在开放权重的小规模模型中仍显脆弱。本研究证明,即使在处理困难任务时仅具备较弱的验证与优化能力,通过我们提出的概率范式——深度自进化推理(DSER),此类模型的推理极限仍可被显著拓展。我们将迭代推理概念化为马尔可夫链,其中每一步代表解空间中的随机转移。核心洞见在于,只要改进的概率略微超过退化的概率,收敛至正确解便得到保证。通过并行运行多个长时程的自进化过程,DSER放大了这些微小的积极趋势,使模型能够渐进地逼近正确答案。实证中,我们将DSER应用于DeepSeek-R1-0528-Qwen3-8B模型。在极具挑战性的AIME 2024-2025基准测试上,DSER解决了9个先前无法解决的问题中的5个,并提升了整体性能,使这一紧凑模型通过多数投票超越了其6000亿参数教师的单轮准确率。除了在测试时扩展的即时效用外,DSER框架还用于诊断当前开放权重推理器的根本局限。通过清晰界定其在自我验证、优化及稳定性方面的不足,我们的研究为开发具备强大内在自进化能力的下一代模型确立了明确的研究议程。
English
Long-form chain-of-thought reasoning has become a cornerstone of advanced
reasoning in large language models. While recent verification-refinement
frameworks have enabled proprietary models to solve Olympiad-level problems,
their effectiveness hinges on strong, reliable verification and correction
capabilities, which remain fragile in open-weight, smaller-scale models. This
work demonstrates that even with weak verification and refinement capabilities
on hard tasks, the reasoning limits of such models can be substantially
extended through a probabilistic paradigm we call Deep Self-Evolving Reasoning
(DSER). We conceptualize iterative reasoning as a Markov chain, where each step
represents a stochastic transition in the solution space. The key insight is
that convergence to a correct solution is guaranteed as long as the probability
of improvement marginally exceeds that of degradation. By running multiple
long-horizon, self-evolving processes in parallel, DSER amplifies these small
positive tendencies, enabling the model to asymptotically approach correct
answers. Empirically, we apply DSER to the DeepSeek-R1-0528-Qwen3-8B model. On
the challenging AIME 2024-2025 benchmark, DSER solves 5 out of 9 previously
unsolvable problems and boosts overall performance, enabling this compact model
to surpass the single-turn accuracy of its 600B-parameter teacher through
majority voting. Beyond its immediate utility for test-time scaling, the DSER
framework serves to diagnose the fundamental limitations of current open-weight
reasoners. By clearly delineating their shortcomings in self-verification,
refinement, and stability, our findings establish a clear research agenda for
developing next-generation models with powerful, intrinsic self-evolving
capabilities.