擴展前瞻推理的推測解碼規模

摘要

推理模型通过生成长链式思维表现出色，但解码由此产生的数千个标记却较为缓慢。标记级推测解码（SD）虽有所助益，但其效益受限，因为随着gamma值的增长，整个gamma标记猜测完全正确的概率呈指数级下降。这意味着为更长的标记草稿分配更多计算资源面临算法上的天花板——使得加速效果有限且与硬件无关。我们通过前瞻推理（Lookahead Reasoning）提升了这一天花板，它利用了第二层，即步骤级的并行性。我们的核心洞察在于，推理模型是逐步生成的，每一步仅需语义正确，无需精确的标记匹配。在前瞻推理中，一个轻量级的草稿模型提出多个未来步骤；目标模型在一次批量处理中扩展每个提议，而验证器则保留语义正确的步骤，同时让目标模型重新生成任何失败的步骤。标记级SD仍在每个推理步骤内运作，因此两层并行性相互叠加。我们展示了前瞻推理在理论上和实证上均提升了SD的峰值加速比。在GSM8K、AIME及其他基准测试中，前瞻推理将SD的加速比从1.4倍提升至2.1倍，同时保持答案质量，并且其加速比随着GPU吞吐量的增加而更好地扩展。我们的代码可在https://github.com/hao-ai-lab/LookaheadReasoning获取。

English

Reasoning models excel by generating long chain-of-thoughts, but decoding the resulting thousands of tokens is slow. Token-level speculative decoding (SD) helps, but its benefit is capped, because the chance that an entire gamma-token guess is correct falls exponentially as gamma grows. This means allocating more compute for longer token drafts faces an algorithmic ceiling -- making the speedup modest and hardware-agnostic. We raise this ceiling with Lookahead Reasoning, which exploits a second, step-level layer of parallelism. Our key insight is that reasoning models generate step-by-step, and each step needs only to be semantically correct, not exact token matching. In Lookahead Reasoning, a lightweight draft model proposes several future steps; the target model expands each proposal in one batched pass, and a verifier keeps semantically correct steps while letting the target regenerate any that fail. Token-level SD still operates within each reasoning step, so the two layers of parallelism multiply. We show Lookahead Reasoning lifts the peak speedup of SD both theoretically and empirically. Across GSM8K, AIME, and other benchmarks, Lookahead Reasoning improves the speedup of SD from 1.4x to 2.1x while preserving answer quality, and its speedup scales better with additional GPU throughput. Our code is available at https://github.com/hao-ai-lab/LookaheadReasoning

擴展前瞻推理的推測解碼規模

Scaling Speculative Decoding with Lookahead Reasoning

摘要

Support