nabla-Reasoner: 潜在空間におけるテスト時勾配降下法による大規模言語モデルの推論

要旨

大規模言語モデル（LLM）における推論時の計算リソース拡大は、前例のない推論能力の解放をもたらした。しかし、既存の推論時スケーリング手法は、オンラインポリシーを改善するために、非効率的で最適とは言えない離散探索アルゴリズムや試行錯誤的なプロンプト操作に依存する傾向がある。本論文では、nabla-Reasoner を提案する。これは、トークンの対数尤度に対する微分可能最適化をデコードループに統合し、その場でポリシーを改善する反復的生成フレームワークである。中核となる要素は微分可能テキスト最適化（DTO）であり、LLMの尤度と報酬モデルからの勾配信号を活用してテキスト表現を洗練させる。nabla-Reasoner はさらに、リジェクションサンプリングと高速化設計を組み込み、デコーディングの頑健性と速度向上を図っている。理論的には、報酬を最大化するためにサンプル空間で推論時に勾配降下法を実行することは、KL正則化強化学習によるLLMポリシーのアラインメントと双対関係にあることを示す。実験的には、nabla-Reasoner は難易度の高い数学的推論ベンチマークで20%超の精度向上を達成し、強力なベースラインと比較してモデル呼び出し回数を約10-40%削減する。全体として、本研究はテスト時におけるゼロ次探索から一次最適化へのパラダイムシフトを導入し、LLMの推論能力を増幅するための費用効果の高い経路を提供する。

English

Scaling inference-time compute for Large Language Models (LLMs) has unlocked unprecedented reasoning capabilities. However, existing inference-time scaling methods typically rely on inefficient and suboptimal discrete search algorithms or trial-and-error prompting to improve the online policy. In this paper, we propose nabla-Reasoner, an iterative generation framework that integrates differentiable optimization over token logits into the decoding loop to refine the policy on the fly. Our core component, Differentiable Textual Optimization (DTO), leverages gradient signals from both the LLM's likelihood and a reward model to refine textual representations. nabla-Reasoner further incorporates rejection sampling and acceleration design to robustify and speed up decoding. Theoretically, we show that performing inference-time gradient descent in the sample space to maximize reward is dual to aligning an LLM policy via KL-regularized reinforcement learning. Empirically, nabla-Reasoner achieves over 20% accuracy improvement on a challenging mathematical reasoning benchmark, while reducing number of model calls by approximately 10-40% compared to strong baselines. Overall, our work introduces a paradigm shift from zeroth-order search to first-order optimization at test time, offering a cost-effective path to amplify LLM reasoning.

nabla-Reasoner: 潜在空間におけるテスト時勾配降下法による大規模言語モデルの推論

nabla-Reasoner: LLM Reasoning via Test-Time Gradient Descent in Latent Space

要旨

Support