룩어헤드 추론을 통한 스펙큘레이티브 디코딩의 확장

초록

추론 모델은 긴 사고 사슬을 생성함으로써 뛰어난 성능을 발휘하지만, 그 결과로 생성된 수천 개의 토큰을 디코딩하는 과정은 느립니다. 토큰 수준의 스펙티브 디코딩(SD)은 이를 돕지만, 전체 감마-토큰 추측이 정확할 확률이 감마가 증가함에 따라 기하급수적으로 감소하기 때문에 그 이점은 한계가 있습니다. 이는 더 긴 토큰 초안을 위해 더 많은 계산 자원을 할당하는 것이 알고리즘적 한계에 직면한다는 것을 의미하며, 이로 인해 속도 향상은 미미하고 하드웨어에 독립적입니다. 우리는 이 한계를 룩어헤드 추론(Lookahead Reasoning)을 통해 극복합니다. 룩어헤드 추론은 두 번째, 단계 수준의 병렬 처리 계층을 활용합니다. 우리의 핵심 통찰은 추론 모델이 단계별로 생성되며, 각 단계는 정확한 토큰 매칭이 아닌 의미적으로 정확하기만 하면 된다는 것입니다. 룩어헤드 추론에서는 경량화된 초안 모델이 여러 미래 단계를 제안하고, 대상 모델이 각 제안을 한 번의 배치 처리로 확장하며, 검증기가 의미적으로 정확한 단계를 유지하고 실패한 단계는 대상 모델이 재생성하도록 합니다. 토큰 수준의 SD는 여전히 각 추론 단계 내에서 작동하므로, 두 계층의 병렬 처리가 곱셈적으로 작용합니다. 우리는 룩어헤드 추론이 SD의 최대 속도 향상을 이론적으로 그리고 실증적으로 높인다는 것을 보여줍니다. GSM8K, AIME 및 기타 벤치마크에서 룩어헤드 추론은 SD의 속도 향상을 1.4배에서 2.1배로 개선하면서 답변 품질을 유지하며, 추가 GPU 처리량에 따라 속도 향상이 더 잘 확장됩니다. 우리의 코드는 https://github.com/hao-ai-lab/LookaheadReasoning에서 확인할 수 있습니다.

English

Reasoning models excel by generating long chain-of-thoughts, but decoding the resulting thousands of tokens is slow. Token-level speculative decoding (SD) helps, but its benefit is capped, because the chance that an entire gamma-token guess is correct falls exponentially as gamma grows. This means allocating more compute for longer token drafts faces an algorithmic ceiling -- making the speedup modest and hardware-agnostic. We raise this ceiling with Lookahead Reasoning, which exploits a second, step-level layer of parallelism. Our key insight is that reasoning models generate step-by-step, and each step needs only to be semantically correct, not exact token matching. In Lookahead Reasoning, a lightweight draft model proposes several future steps; the target model expands each proposal in one batched pass, and a verifier keeps semantically correct steps while letting the target regenerate any that fail. Token-level SD still operates within each reasoning step, so the two layers of parallelism multiply. We show Lookahead Reasoning lifts the peak speedup of SD both theoretically and empirically. Across GSM8K, AIME, and other benchmarks, Lookahead Reasoning improves the speedup of SD from 1.4x to 2.1x while preserving answer quality, and its speedup scales better with additional GPU throughput. Our code is available at https://github.com/hao-ai-lab/LookaheadReasoning

룩어헤드 추론을 통한 스펙큘레이티브 디코딩의 확장

Scaling Speculative Decoding with Lookahead Reasoning

초록

Support