올바르게 구현된 배치 예측 디코딩

초록

추측 디코딩은 작은 드래프트 모델을 사용해 여러 토큰을 제안하고 대상 모델이 이를 병렬로 검증함으로써 LLM 추론 속도를 높입니다. 이 개념을 배치로 확장하는 것은 프로덕션 서빙에 필수적이지만, '래기드 텐서 문제'를 야기합니다: 동일한 배치 내 시퀀스들이 서로 다른 수의 드래프트 토큰을 수용하게 되어 우측 정렬이 깨지고, 위치 ID, 어텐션 마스크, KV 캐시 상태가 손상됩니다. 우리는 기존의 여러 배치 구현 방식들이 출력 동등성—즉, 추측 디코딩이 표준 자기회귀 생성과 동일한 토큰 시퀀스를 반드시 생성해야 한다는 근본 요구사항—을 위반함을 보입니다. 이러한 위반은 정확히 래기드 텐서 문제의 부적절한 처리 때문에 발생합니다. 이에 대응하여 우리는 (1) 정확성을 보장하는 동기화 요구사항을 규명하고, (2) 오버헤드의 40%를 재정렬 과정이 차지함을 보여주는 정확성 우선 배치 추측 디코딩 방식 EQSPEC을 제시하며, (3) 재정렬 오버헤드를 줄이면서 시퀀스별 추측 가속은 유지하기 위해 슬라이딩 풀을 유지하고 동일한 길이의 그룹을 동적으로 구성하는 EXSPEC을 소개합니다. SpecBench 데이터셋에서 Vicuna-7B/68M, Qwen3-8B/0.6B, GLM-4-9B/0.6B 대상/드래프트 모델 쌍에 대해, 우리의 방법은 배치 크기 1 대비 배치 크기 8에서 최대 3배의 처리량 향상을 달성했으며, 배치 크기 8까지 효율적인 확장성을 보였고, 95%의 출력 동등성을 유지했습니다. 우리의 방법은 커스텀 커널이 필요 없으며 기존 추론 스택에 깔끔하게 통합됩니다. 코드는 https://github.com/eBay/spec_dec에서 이용 가능합니다.

English

Speculative decoding speeds up LLM inference by using a small draft model to propose multiple tokens that a target model verifies in parallel. Extending this idea to batches is essential for production serving, but it introduces the ragged tensor problem: sequences in the same batch accept different numbers of draft tokens, breaking right-alignment and corrupting position IDs, attention masks, and KV-cache state. We show that several existing batch implementations violate output equivalence-the fundamental requirement that speculative decoding must produce identical token sequences to standard autoregressive generation. These violations occur precisely due to improper handling of the ragged tensor problem. In response, we (1) characterize the synchronization requirements that guarantee correctness, (2) present a correctness-first batch speculative decoding EQSPEC that exposes realignment as consuming 40% of overhead, and (3) introduce EXSPEC, which maintains a sliding pool of sequences and dynamically forms same-length groups, to reduce the realignment overhead while preserving per-sequence speculative speedups. On the SpecBench dataset, across Vicuna-7B/68M, Qwen3-8B/0.6B, and GLM-4-9B/0.6B target/draft pairs, our approach achieves up to 3times throughput improvement at batch size 8 compared to batch size 1, with efficient scaling through batch size 8, while maintaining 95% output equivalence. Our method requires no custom kernels and integrates cleanly with existing inference stacks. Our code is available at https://github.com/eBay/spec_dec.

올바르게 구현된 배치 예측 디코딩

Batch Speculative Decoding Done Right

초록

Support