ルックアヘッド推論を用いた推測的デコーディングのスケーリング

要旨

推論モデルは長い連鎖思考を生成することで優れた性能を発揮しますが、その結果として生じる数千のトークンをデコードするのは遅いという課題があります。トークンレベルの推測的デコーディング（SD）はこれを改善しますが、その効果には限界があります。なぜなら、ガンマトークンの推測全体が正しい確率は、ガンマが大きくなるにつれて指数関数的に低下するからです。これは、より長いトークンドラフトに対して計算リソースを割り当てても、アルゴリズム的な上限に直面することを意味し、高速化の効果は控えめでハードウェアに依存しないものとなります。私たちは、この上限を「Lookahead Reasoning」によって引き上げます。これは、第二の並列処理層であるステップレベルを活用するものです。私たちの重要な洞察は、推論モデルがステップバイステップで生成を行い、各ステップは正確なトークンマッチングではなく、意味的に正しいだけで十分であるということです。Lookahead Reasoningでは、軽量なドラフトモデルが複数の将来のステップを提案し、ターゲットモデルが各提案を一括処理で展開し、検証器が意味的に正しいステップを保持し、失敗したステップはターゲットモデルに再生成させます。トークンレベルのSDは各推論ステップ内で依然として動作するため、二つの並列処理層が相乗効果を発揮します。私たちは、Lookahead ReasoningがSDのピーク高速化を理論的にも実証的にも向上させることを示します。GSM8K、AIME、その他のベンチマークにおいて、Lookahead ReasoningはSDの高速化を1.4倍から2.1倍に改善し、回答品質を維持しつつ、追加のGPUスループットに対してより良いスケーリングを示します。私たちのコードはhttps://github.com/hao-ai-lab/LookaheadReasoningで公開されています。

English

Reasoning models excel by generating long chain-of-thoughts, but decoding the resulting thousands of tokens is slow. Token-level speculative decoding (SD) helps, but its benefit is capped, because the chance that an entire gamma-token guess is correct falls exponentially as gamma grows. This means allocating more compute for longer token drafts faces an algorithmic ceiling -- making the speedup modest and hardware-agnostic. We raise this ceiling with Lookahead Reasoning, which exploits a second, step-level layer of parallelism. Our key insight is that reasoning models generate step-by-step, and each step needs only to be semantically correct, not exact token matching. In Lookahead Reasoning, a lightweight draft model proposes several future steps; the target model expands each proposal in one batched pass, and a verifier keeps semantically correct steps while letting the target regenerate any that fail. Token-level SD still operates within each reasoning step, so the two layers of parallelism multiply. We show Lookahead Reasoning lifts the peak speedup of SD both theoretically and empirically. Across GSM8K, AIME, and other benchmarks, Lookahead Reasoning improves the speedup of SD from 1.4x to 2.1x while preserving answer quality, and its speedup scales better with additional GPU throughput. Our code is available at https://github.com/hao-ai-lab/LookaheadReasoning

ルックアヘッド推論を用いた推測的デコーディングのスケーリング

Scaling Speculative Decoding with Lookahead Reasoning

要旨

Support