Clover: 순차적 지식을 활용한 회귀형 경량 스페큘레이티브 디코딩

초록

대규모 언어 모델(LLM)은 자동 회귀 디코딩의 요구사항과 대부분의 현대 GPU 설계 간의 불일치로 인해 낮은 효율성을 겪고 있습니다. 구체적으로, 수십억에서 수조 개의 파라미터가 GPU 캐시로 제한된 메모리 대역폭을 통해 로드되어 계산되지만, 실제로는 소량의 토큰만이 계산됩니다. 결과적으로 GPU는 계산보다는 메모리 전송에 대부분의 시간을 소비하게 됩니다. 최근에는 병렬 디코딩이라는 스펙티브 디코딩 알고리즘의 한 유형이 더욱 인기를 끌며 생성 과정에서 인상적인 효율성 개선을 보여주고 있습니다. 이 방법은 대형 모델에 추가 디코딩 헤드를 도입하여 여러 후속 토큰을 동시에 예측하고 이러한 후보 연속성을 단일 디코딩 단계에서 검증할 수 있게 합니다. 그러나 이 접근 방식은 사전 학습 중 사용된 다음 토큰 예측 훈련 목표와는 다르기 때문에 후보 토큰의 적중률이 낮습니다. 본 논문에서는 병렬 디코딩 과정에 순차적 지식을 통합한 새로운 스펙티브 디코딩 알고리즘인 Clover를 제안합니다. 이 개선은 스펙티베이터의 적중률을 향상시켜 전반적인 효율성을 높입니다. Clover는 회귀 연결(Regressive Connection)을 통해 사전 예측된 토큰으로부터 순차적 지식을 전달한 후, 주의 디코더(Attention Decoder)를 사용하여 이러한 예측 토큰을 통합합니다. 또한, Clover는 다음 토큰 예측이 아닌 스펙티브 생성을 목적으로 숨겨진 상태를 수정하는 증강 블록(Augmenting Block)을 포함합니다. 실험 결과, Clover는 Baichuan-Small에서 최대 91%, Baichuan-Large에서 최대 146%로 기준선을 능가하며, 이전 최고 성능 방법인 Medusa를 Baichuan-Small에서 최대 37%, Baichuan-Large에서 최대 57%까지 초과하는 성능을 보여줍니다.

English

Large language models (LLMs) suffer from low efficiency as the mismatch between the requirement of auto-regressive decoding and the design of most contemporary GPUs. Specifically, billions to trillions of parameters must be loaded to the GPU cache through its limited memory bandwidth for computation, but only a small batch of tokens is actually computed. Consequently, the GPU spends most of its time on memory transfer instead of computation. Recently, parallel decoding, a type of speculative decoding algorithms, is becoming more popular and has demonstrated impressive efficiency improvement in generation. It introduces extra decoding heads to large models, enabling them to predict multiple subsequent tokens simultaneously and verify these candidate continuations in a single decoding step. However, this approach deviates from the training objective of next token prediction used during pre-training, resulting in a low hit rate for candidate tokens. In this paper, we propose a new speculative decoding algorithm, Clover, which integrates sequential knowledge into the parallel decoding process. This enhancement improves the hit rate of speculators and thus boosts the overall efficiency. Clover transmits the sequential knowledge from pre-speculated tokens via the Regressive Connection, then employs an Attention Decoder to integrate these speculated tokens. Additionally, Clover incorporates an Augmenting Block that modifies the hidden states to better align with the purpose of speculative generation rather than next token prediction. The experiment results demonstrate that Clover outperforms the baseline by up to 91% on Baichuan-Small and 146% on Baichuan-Large, respectively, and exceeds the performance of the previously top-performing method, Medusa, by up to 37% on Baichuan-Small and 57% on Baichuan-Large, respectively.

Clover: 순차적 지식을 활용한 회귀형 경량 스페큘레이티브 디코딩

Clover: Regressive Lightweight Speculative Decoding with Sequential Knowledge

초록

Support