三叶草:具有顺序知识的轻量级回归推理解码
Clover: Regressive Lightweight Speculative Decoding with Sequential Knowledge
May 1, 2024
作者: Bin Xiao, Chunan Shi, Xiaonan Nie, Fan Yang, Xiangwei Deng, Lei Su, Weipeng Chen, Bin Cui
cs.AI
摘要
大型语言模型(LLMs)由于自回归解码需求与大多数当代GPU设计之间的不匹配而效率低下。具体而言,需要将数十亿至数万亿个参数通过有限的内存带宽加载到GPU缓存中进行计算,但实际上只计算了一小批标记。因此,GPU大部分时间都花在内存传输上,而不是计算上。最近,并行解码,一种投机解码算法,变得越来越受欢迎,并在生成中展示出令人印象深刻的效率改进。它向大型模型引入额外的解码头,使它们能够同时预测多个后续标记,并在单个解码步骤中验证这些候选延续。然而,这种方法偏离了预训练期间用于下一个标记预测的训练目标,导致候选标记的低命中率。在本文中,我们提出了一种新的投机解码算法Clover,它将顺序知识整合到并行解码过程中。这种增强改善了投机者的命中率,从而提高了整体效率。Clover通过回归连接从预先推测的标记传递顺序知识,然后利用注意力解码器整合这些推测的标记。此外,Clover还包括一个增强块,修改隐藏状态以更好地与投机生成的目的对齐,而不是下一个标记预测。实验结果表明,Clover在Baichuan-Small上的性能比基线高出高达91%,在Baichuan-Large上高出146%,分别超过了之前性能最佳的方法Medusa在Baichuan-Small上高出37%,在Baichuan-Large上高出57%。
English
Large language models (LLMs) suffer from low efficiency as the mismatch
between the requirement of auto-regressive decoding and the design of most
contemporary GPUs. Specifically, billions to trillions of parameters must be
loaded to the GPU cache through its limited memory bandwidth for computation,
but only a small batch of tokens is actually computed. Consequently, the GPU
spends most of its time on memory transfer instead of computation. Recently,
parallel decoding, a type of speculative decoding algorithms, is becoming more
popular and has demonstrated impressive efficiency improvement in generation.
It introduces extra decoding heads to large models, enabling them to predict
multiple subsequent tokens simultaneously and verify these candidate
continuations in a single decoding step. However, this approach deviates from
the training objective of next token prediction used during pre-training,
resulting in a low hit rate for candidate tokens. In this paper, we propose a
new speculative decoding algorithm, Clover, which integrates sequential
knowledge into the parallel decoding process. This enhancement improves the hit
rate of speculators and thus boosts the overall efficiency. Clover transmits
the sequential knowledge from pre-speculated tokens via the Regressive
Connection, then employs an Attention Decoder to integrate these speculated
tokens. Additionally, Clover incorporates an Augmenting Block that modifies the
hidden states to better align with the purpose of speculative generation rather
than next token prediction. The experiment results demonstrate that Clover
outperforms the baseline by up to 91% on Baichuan-Small and 146% on
Baichuan-Large, respectively, and exceeds the performance of the previously
top-performing method, Medusa, by up to 37% on Baichuan-Small and 57% on
Baichuan-Large, respectively.Summary
AI-Generated Summary