SpecReason: 추론 시간 계산을 위한 빠르고 정확한 사전 추론 기법

초록

최근 추론 시간 계산(inference-time compute) 분야의 발전으로, 대규모 추론 모델(Large Reasoning Models, LRMs)을 사용하여 긴 사고 연쇄(chain of thought, CoTs)를 생성함으로써 복잡한 작업에서의 성능이 크게 향상되었습니다. 그러나 이러한 정확도 향상은 생성된 추론 시퀀스의 길이와 디코딩의 자기회귀적 특성으로 인해 높은 추론 지연 시간(latency)을 초래합니다. 이러한 오버헤드를 해결하기 위한 우리의 핵심 통찰은 LRM 추론과 그 안에 내포된 추론이 근사치에 대해 매우 관대하다는 점입니다. 복잡한 작업은 일반적으로 더 간단한 단계로 분해되며, 각 단계는 정확한 토큰을 생성하는 것보다는 하위 단계에 대한 의미론적 통찰을 제공함으로써 유용성을 가집니다. 이에 따라, 우리는 SpecReason 시스템을 소개합니다. 이 시스템은 가벼운 모델을 사용하여 (추측적으로) 간단한 중간 추론 단계를 수행하고, 비용이 많이 드는 기본 모델은 추측된 출력을 평가(및 필요 시 수정)하는 데만 사용함으로써 LRM 추론을 자동으로 가속화합니다. 특히, SpecReason는 최종 답변의 정확성을 보존하기 위해 사고 토큰의 의미론적 유연성을 활용하는 데 초점을 맞추며, 이는 각 단계에서 토큰 수준의 동등성을 요구하는 기존의 추측 디코딩(speculative decoding) 기술과 상호 보완적입니다. 다양한 추론 벤치마크에서 SpecReason는 기본 LRM 추론 대비 1.5-2.5배의 속도 향상을 달성하면서 정확도를 1.0-9.9% 개선했습니다. SpecReason 없이 추측 디코딩을 사용한 경우와 비교했을 때, 이 둘을 결합하면 추가로 19.4-44.2%의 지연 시간 감소를 얻을 수 있었습니다. 우리는 SpecReason를 https://github.com/ruipeterpan/specreason에서 오픈소스로 공개했습니다.

English

Recent advances in inference-time compute have significantly improved performance on complex tasks by generating long chains of thought (CoTs) using Large Reasoning Models (LRMs). However, this improved accuracy comes at the cost of high inference latency due to the length of generated reasoning sequences and the autoregressive nature of decoding. Our key insight in tackling these overheads is that LRM inference, and the reasoning that it embeds, is highly tolerant of approximations: complex tasks are typically broken down into simpler steps, each of which brings utility based on the semantic insight it provides for downstream steps rather than the exact tokens it generates. Accordingly, we introduce SpecReason, a system that automatically accelerates LRM inference by using a lightweight model to (speculatively) carry out simpler intermediate reasoning steps and reserving the costly base model only to assess (and potentially correct) the speculated outputs. Importantly, SpecReason's focus on exploiting the semantic flexibility of thinking tokens in preserving final-answer accuracy is complementary to prior speculation techniques, most notably speculative decoding, which demands token-level equivalence at each step. Across a variety of reasoning benchmarks, SpecReason achieves 1.5-2.5times speedup over vanilla LRM inference while improving accuracy by 1.0-9.9\%. Compared to speculative decoding without SpecReason, their combination yields an additional 19.4-44.2\% latency reduction. We open-source SpecReason at https://github.com/ruipeterpan/specreason.

SpecReason: 추론 시간 계산을 위한 빠르고 정확한 사전 추론 기법

SpecReason: Fast and Accurate Inference-Time Compute via Speculative Reasoning

초록

Support