深層自己進化推論

要旨

長文の連鎖思考推論は、大規模言語モデルにおける高度な推論の基盤となっている。最近の検証・改良フレームワークにより、専有モデルがオリンピアドレベルの問題を解決できるようになったが、その有効性は強力で信頼性の高い検証と修正能力に依存しており、オープンウェイトの小規模モデルでは依然として脆弱である。本研究では、困難なタスクにおける弱い検証・改良能力であっても、確率的パラダイムであるDeep Self-Evolving Reasoning (DSER) を通じて、そのようなモデルの推論限界を大幅に拡張できることを示す。反復推論をマルコフ連鎖として概念化し、各ステップが解空間における確率的遷移を表す。重要な洞察は、改善の確率が劣化の確率をわずかに上回る限り、正しい解への収束が保証されることである。DSERは、複数の長期的な自己進化プロセスを並列に実行することで、これらの小さなポジティブな傾向を増幅し、モデルが漸近的に正解に近づくことを可能にする。実証的に、DSERをDeepSeek-R1-0528-Qwen3-8Bモデルに適用した。挑戦的なAIME 2024-2025ベンチマークにおいて、DSERは以前解決できなかった9問中5問を解決し、全体のパフォーマンスを向上させ、このコンパクトモデルが多数決を通じて600Bパラメータの教師モデルの単一ターン精度を上回ることを可能にした。テスト時のスケーリングにおける即時の有用性を超えて、DSERフレームワークは、現在のオープンウェイト推論モデルの根本的な限界を診断する役割を果たす。自己検証、改良、安定性における欠点を明確に描き出すことで、我々の知見は、強力な内在的な自己進化能力を持つ次世代モデルの開発に向けた明確な研究課題を確立する。

English

Long-form chain-of-thought reasoning has become a cornerstone of advanced reasoning in large language models. While recent verification-refinement frameworks have enabled proprietary models to solve Olympiad-level problems, their effectiveness hinges on strong, reliable verification and correction capabilities, which remain fragile in open-weight, smaller-scale models. This work demonstrates that even with weak verification and refinement capabilities on hard tasks, the reasoning limits of such models can be substantially extended through a probabilistic paradigm we call Deep Self-Evolving Reasoning (DSER). We conceptualize iterative reasoning as a Markov chain, where each step represents a stochastic transition in the solution space. The key insight is that convergence to a correct solution is guaranteed as long as the probability of improvement marginally exceeds that of degradation. By running multiple long-horizon, self-evolving processes in parallel, DSER amplifies these small positive tendencies, enabling the model to asymptotically approach correct answers. Empirically, we apply DSER to the DeepSeek-R1-0528-Qwen3-8B model. On the challenging AIME 2024-2025 benchmark, DSER solves 5 out of 9 previously unsolvable problems and boosts overall performance, enabling this compact model to surpass the single-turn accuracy of its 600B-parameter teacher through majority voting. Beyond its immediate utility for test-time scaling, the DSER framework serves to diagnose the fundamental limitations of current open-weight reasoners. By clearly delineating their shortcomings in self-verification, refinement, and stability, our findings establish a clear research agenda for developing next-generation models with powerful, intrinsic self-evolving capabilities.