덜 초안하고 더 검색하라: 추측 디코딩을 위한 하이브리드 트리 구축

초록

스펙큘레이티브 디코딩(Speculative Decoding, SD)은 초안 작성 후 검증(draft-then-verify) 패러다임을 활용하여 대규모 언어 모델 추론을 가속화한다. 수용률(acceptance rate)을 극대화하기 위해 최근 방법들은 확장된 초안 트리(draft tree)를 구축하는데, 안타깝게도 이는 심각한 VRAM 대역폭 및 계산 오버헤드를 초래하여 종단 간 속도 향상의 병목이 된다. 동적 깊이 가지치기(dynamic-depth pruning)는 중요도가 낮은 분기를 제거하여 이러한 지연 시간을 줄일 수 있지만, 잠재적으로 유효한 후보를 함께 폐기함으로써 수용률이 밀집 트리(dense tree)의 상한에 도달하지 못하게 한다. 본 논문에서는 자원 할당에 있어 중요한 기회를 식별한다: 밀집 초안 작성에서 가지치기된 초안 작성으로의 전환은 상당한 계산 예산을 확보해준다. 이 파레토 트레이드오프(Pareto tradeoff)를 깨기 위해, 우리는 가지치기와 검색을 상호 보완적인 연산으로 결합하는 보상 프레임워크인 Graft를 도입한다. 가지치기는 검색에 충분한 예산을 제공하고, 검색은 가지치기로 인한 커버리지 손실을 보상하며 수용 길이를 회복한다. 순차적인 '가지치기 후 접목(prune-then-graft)' 메커니즘을 통해, Graft는 가지치기로 열린 위치에 예측력이 높은 검색 토큰을 부착하여 거의 제로에 가까운 오버헤드로 토폴로지적 공백을 메운다. Graft는 완전히 학습 없이(training-free) 수행되며 손실이 없다. 포괄적인 평가 결과, Graft는 단문 컨텍스트 생성, 장문 컨텍스트 생성 및 대규모 모델을 포함한 실제 배포 환경에서 새로운 파레토 프론티어(Pareto frontier)를 구축함을 보여준다. 단문 컨텍스트 벤치마크에서는 최대 5.41배의 속도 향상을 달성하고, 대규모 Qwen3-235B 모델에서 EAGLE-3 대비 평균 속도 향상을 최대 21.8% 개선한다. 또한, DFlash 스타일 블록 초안 작성 패러다임에 Graft를 적용하는 예비 탐색을 제공하여, 자기회귀 초안 트리를 넘어 접목(grafting)을 확장하기 위한 초기 증거와 통찰력을 제시한다.

English

Speculative decoding (SD) accelerates large language model inference by leveraging a draft-then-verify paradigm. To maximize the acceptance rate, recent methods construct expansive draft trees, which unfortunately incur severe VRAM bandwidth and computational overheads that bottleneck end-to-end speedups. While dynamic-depth pruning can reduce this latency by removing marginal branches, it also discards potentially valid candidates, preventing the acceptance rate from reaching the upper bound of dense trees. In this paper, we identify a critical opportunity in resource allocation: the transition from dense to pruned drafting frees up significant computational budget. To break this Pareto tradeoff, we introduce Graft, a compensation framework that couples pruning and retrieval as mutually reinforcing operations. Pruning supplies sufficient budget for retrieval, while retrieval compensates for pruning-induced coverage loss and recovers accepted length. By employing a sequential `prune-then-graft' mechanism, Graft attaches highly predictive retrieved tokens into positions opened by pruning, filling the topological gaps with near-zero overhead. Graft is entirely training-free and lossless. Comprehensive evaluations show that Graft establishes a new Pareto frontier across practical deployment settings, including short-context generation, long-context generation, and large-scale models. On short-context benchmarks, it achieves up to 5.41times speedup and improves average speedup over EAGLE-3 by up to 21.8% on the large-scale Qwen3-235B. We also provide a preliminary exploration of applying Graft to the DFlash-style block drafting paradigm, offering initial evidence and insights for extending grafting beyond autoregressive draft trees.