SpecBlock: 블록 반복적 추측 디코딩과 동적 트리 드래프팅

초록

추측적 디코딩은 후보 연속 구간의 트리를 작성하고 이를 한 번의 타겟 순방향 패스에서 검증함으로써 LLM 추론을 가속화한다. 기존 드래프터는 상반된 약점을 가진 두 가지 진영으로 나뉜다. EAGLE-3와 같은 자기회귀적 드래프터는 각 드래프트 경로를 따라 의존성을 유지하지만 트리 깊이마다 드래프터를 한 번씩 호출하므로, 드래프팅이 반복당 지연 시간에서 무시할 수 없는 비중을 차지한다. 병렬 드래프터는 한 번의 순방향 패스에서 여러 미래 위치를 예측함으로써 드래프터 호출을 줄이지만, 각 위치는 다른 위치를 보지 않고 예측되어 검증기가 거부하는 경로를 생성한다. 본 논문에서는 경로 의존성과 저비용 드래프팅을 결합한 블록 반복 드래프터인 SpecBlock을 제안한다. 각 드래프터 순방향 패스는 K개의 의존적 위치를 생성하며, 이를 블록이라고 부른다. 드래프트 트리는 반복적인 블록 확장을 통해 성장한다. 두 가지 메커니즘이 명시적으로 경로 의존성을 전달하여 이후 드래프트 위치의 정확도를 유지한다. 각 블록 내에서는 계층별 이동(layer-wise shift)을 통해 이전 위치의 은닉 상태를 모든 디코더 계층으로 전달한다. 블록 간에는 각 새 블록이 이전 블록의 임의 위치에서 시작할 수 있으며, 해당 은닉 상태를 상속받아 경로를 확장한다. 수용 가능성이 높은 곳에 검증기 예산을 사용하기 위해, 공동 훈련된 순위 헤드(rank head)는 고정된 top-k 트리를 대체하여 드래프팅 중에 위치별 분기(branching)를 할당한다. 드래프터가 추론 시 생성하지 않는 접두사에 대해 훈련되는 것을 방지하기 위해, 유효 접두사 마스크(valid-prefix mask)는 이전 위치가 틀렸을 경우 이후 위치의 손실을 제거한다. 정적 드래프팅을 넘어, 배포 시 비용 인식 밴딧(cost-aware bandit)은 무료 검증기 피드백을 사용하여 예상 처리량 이득이 업데이트 비용을 초과할 때에만 선택적으로 드래프터를 업데이트한다. 실험 결과, SpecBlock은 EAGLE-3 대비 드래프팅 비용이 44-52%에 불과하면서 평균 속도 향상률을 8-13% 개선하였고, 비용 인식 적응을 통해 이 차이를 11-19%로 확장함을 보여준다.

English

Speculative decoding accelerates LLM inference by drafting a tree of candidate continuations and verifying it in one target forward. Existing drafters fall into two camps with opposite weaknesses. Autoregressive drafters such as EAGLE-3 preserve dependence along each draft path but call the drafter once per tree depth, making drafting a non-trivial share of per-iteration latency. Parallel drafters cut drafter calls by predicting multiple future positions in one forward, but each position is predicted without seeing the others, producing paths the verifier rejects. In this paper, we propose SpecBlock, a block-iterative drafter that combines path dependence with cheap drafting. Each drafter forward produces K dependent positions and we call this a block. The draft tree grows through repeated block expansions. Two mechanisms explicitly carry path dependence to keep later draft positions accurate. Within each block, a layer-wise shift carries the previous position's hidden state into every decoder layer. Across blocks, each new block can start from any position of the previous block, inheriting its hidden state to extend the path. To spend verifier budget where acceptance is likely, a co-trained rank head replaces the fixed top-k tree by allocating per-position branching during drafting. To avoid training the drafter on prefixes it never produces at inference, a valid-prefix mask drops the loss at later positions once an earlier one is wrong. Beyond static drafting, a cost-aware bandit at deployment uses free verifier feedback to update the drafter selectively, only when the expected throughput gain exceeds the update cost. Experiments show that SpecBlock improves mean speedup by 8-13% over EAGLE-3 at 44-52% of its drafting cost, and cost-aware adaptation extends this lead to 11-19%.