深度自进化推理

摘要

长链式思维推理已成为大型语言模型高级推理的基石。尽管近期的验证-优化框架已使专有模型能够解决奥林匹克级别的难题，但其有效性依赖于强大且可靠的验证与修正能力，而这些能力在开放权重的小规模模型中仍显脆弱。本研究证明，即使在处理困难任务时仅具备较弱的验证与优化能力，通过我们提出的概率范式——深度自进化推理（DSER），此类模型的推理极限仍可被显著拓展。我们将迭代推理概念化为马尔可夫链，其中每一步代表解空间中的随机转移。核心洞见在于，只要改进的概率略微超过退化的概率，收敛至正确解便得到保证。通过并行运行多个长时程的自进化过程，DSER放大了这些微小的积极趋势，使模型能够渐进地逼近正确答案。实证中，我们将DSER应用于DeepSeek-R1-0528-Qwen3-8B模型。在极具挑战性的AIME 2024-2025基准测试上，DSER解决了9个先前无法解决的问题中的5个，并提升了整体性能，使这一紧凑模型通过多数投票超越了其6000亿参数教师的单轮准确率。除了在测试时扩展的即时效用外，DSER框架还用于诊断当前开放权重推理器的根本局限。通过清晰界定其在自我验证、优化及稳定性方面的不足，我们的研究为开发具备强大内在自进化能力的下一代模型确立了明确的研究议程。

English

Long-form chain-of-thought reasoning has become a cornerstone of advanced reasoning in large language models. While recent verification-refinement frameworks have enabled proprietary models to solve Olympiad-level problems, their effectiveness hinges on strong, reliable verification and correction capabilities, which remain fragile in open-weight, smaller-scale models. This work demonstrates that even with weak verification and refinement capabilities on hard tasks, the reasoning limits of such models can be substantially extended through a probabilistic paradigm we call Deep Self-Evolving Reasoning (DSER). We conceptualize iterative reasoning as a Markov chain, where each step represents a stochastic transition in the solution space. The key insight is that convergence to a correct solution is guaranteed as long as the probability of improvement marginally exceeds that of degradation. By running multiple long-horizon, self-evolving processes in parallel, DSER amplifies these small positive tendencies, enabling the model to asymptotically approach correct answers. Empirically, we apply DSER to the DeepSeek-R1-0528-Qwen3-8B model. On the challenging AIME 2024-2025 benchmark, DSER solves 5 out of 9 previously unsolvable problems and boosts overall performance, enabling this compact model to surpass the single-turn accuracy of its 600B-parameter teacher through majority voting. Beyond its immediate utility for test-time scaling, the DSER framework serves to diagnose the fundamental limitations of current open-weight reasoners. By clearly delineating their shortcomings in self-verification, refinement, and stability, our findings establish a clear research agenda for developing next-generation models with powerful, intrinsic self-evolving capabilities.