Clover:具有順序知識的輕量級推理式回譯
Clover: Regressive Lightweight Speculative Decoding with Sequential Knowledge
May 1, 2024
作者: Bin Xiao, Chunan Shi, Xiaonan Nie, Fan Yang, Xiangwei Deng, Lei Su, Weipeng Chen, Bin Cui
cs.AI
摘要
大型語言模型(LLMs)因自回歸解碼需求與大多數當代 GPU 設計不匹配而效率低下。具體而言,數十億至數萬億個參數必須透過有限的記憶體頻寬加載到 GPU 快取中進行計算,但實際上只有一小批標記被計算。因此,GPU 大部分時間花在記憶體傳輸而非計算上。最近,並行解碼,一種推測解碼算法,變得越來越受歡迎,並在生成中展示了令人印象深刻的效率提升。它向大型模型引入額外的解碼頭,使它們能夠同時預測多個後續標記,並在單個解碼步驟中驗證這些候選續集。然而,這種方法偏離了預訓練期間使用的下一個標記預測的訓練目標,導致候選標記的低命中率。在本文中,我們提出了一種新的推測解碼算法 Clover,它將順序知識整合到並行解碼過程中。這種增強改善了推測器的命中率,從而提高了整體效率。Clover 通過 Regressive Connection 從預測標記傳輸順序知識,然後利用 Attention Decoder 整合這些預測標記。此外,Clover 還包括一個增強塊,用於修改隱藏狀態,以更好地配合推測生成的目的,而非下一個標記預測。實驗結果表明,Clover 在 Baichuan-Small 上的表現比基準提高了高達 91%,在 Baichuan-Large 上提高了 146%,分別超過了之前表現最佳的方法 Medusa 在 Baichuan-Small 和 Baichuan-Large 上的表現高達 37% 和 57%。
English
Large language models (LLMs) suffer from low efficiency as the mismatch
between the requirement of auto-regressive decoding and the design of most
contemporary GPUs. Specifically, billions to trillions of parameters must be
loaded to the GPU cache through its limited memory bandwidth for computation,
but only a small batch of tokens is actually computed. Consequently, the GPU
spends most of its time on memory transfer instead of computation. Recently,
parallel decoding, a type of speculative decoding algorithms, is becoming more
popular and has demonstrated impressive efficiency improvement in generation.
It introduces extra decoding heads to large models, enabling them to predict
multiple subsequent tokens simultaneously and verify these candidate
continuations in a single decoding step. However, this approach deviates from
the training objective of next token prediction used during pre-training,
resulting in a low hit rate for candidate tokens. In this paper, we propose a
new speculative decoding algorithm, Clover, which integrates sequential
knowledge into the parallel decoding process. This enhancement improves the hit
rate of speculators and thus boosts the overall efficiency. Clover transmits
the sequential knowledge from pre-speculated tokens via the Regressive
Connection, then employs an Attention Decoder to integrate these speculated
tokens. Additionally, Clover incorporates an Augmenting Block that modifies the
hidden states to better align with the purpose of speculative generation rather
than next token prediction. The experiment results demonstrate that Clover
outperforms the baseline by up to 91% on Baichuan-Small and 146% on
Baichuan-Large, respectively, and exceeds the performance of the previously
top-performing method, Medusa, by up to 37% on Baichuan-Small and 57% on
Baichuan-Large, respectively.Summary
AI-Generated Summary