DuoDecoding: 동적 다중 시퀀스 드래프팅을 통한 하드웨어 인식 이종 스펙큘레이티브 디코딩

초록

대규모 언어 모델(LLMs)은 다양한 작업에서 탁월한 성능을 보여주지만, 토큰 단위의 자기회귀적 생성 과정으로 인해 추론 속도가 크게 저하됩니다. 스펙티브 디코딩은 출력 분포의 충실도를 유지하면서 생성 지연 시간을 줄이는 유망한 드래프트-검증 프레임워크를 제시합니다. 그러나 드래프트 모델은 추가적인 계산 오버헤드를 유발하여 성능 병목 현상을 일으키고 첫 토큰까지의 시간(TTFT)을 증가시킵니다. 드래프트 모델 오버헤드를 완화하기 위한 기존 접근 방식은 주로 휴리스틱에 의존했으며, 일반적으로 드래프트 언어 모델의 품질을 따라잡지 못했습니다. 이러한 문제를 해결하기 위해, 우리는 CPU와 GPU에 각각 드래프트 모델과 타겟 모델을 전략적으로 배치하여 드래프트 품질을 유지하면서 병렬 디코딩을 가능하게 하는 새로운 접근 방식인 DuoDecoding을 제안합니다. 우리의 방법은 하드웨어 인식 최적 드래프트 예산을 도입하여 유휴 시간을 최소화하고, 동적 다중 시퀀스 드래프팅을 통해 드래프트 품질을 향상시킵니다. 7가지 작업에 걸친 광범위한 실험 결과, DuoDecoding은 생성 지연 시간에서 최대 2.61배의 속도 향상을 달성했으며, TTFT를 기존 스펙티브 디코딩의 83%로 줄였습니다. 코드는 https://github.com/KaiLv69/DuoDecoding에서 확인할 수 있습니다.

English

Large language models (LLMs) exhibit exceptional performance across a wide range of tasks; however, their token-by-token autoregressive generation process significantly hinders inference speed. Speculative decoding presents a promising draft-then-verify framework that reduces generation latency while maintaining output distribution fidelity. Nevertheless, the draft model introduces additional computational overhead, becoming a performance bottleneck and increasing the time to first token (TTFT). Previous approaches to mitigate draft model overhead have primarily relied on heuristics and generally failed to match the quality of the draft language models. To address these challenges, we propose DuoDecoding, a novel approach that strategically deploys the draft and target models on the CPU and GPU respectively, enabling parallel decoding while preserving draft quality. Our method incorporates a hardware-aware optimal draft budget to minimize idle times and employs dynamic multi-sequence drafting to enhance draft quality. Extensive experiments across seven tasks show that DuoDecoding achieves up to 2.61x speedup in generation latency, while reducing TTFT to 83% of that in conventional speculative decoding. The Code is available at https://github.com/KaiLv69/DuoDecoding.

DuoDecoding: 동적 다중 시퀀스 드래프팅을 통한 하드웨어 인식 이종 스펙큘레이티브 디코딩

DuoDecoding: Hardware-aware Heterogeneous Speculative Decoding with Dynamic Multi-Sequence Drafting

초록

Support