三叶草：具有顺序知识的轻量级回归推理解码

摘要

大型语言模型（LLMs）由于自回归解码需求与大多数当代GPU设计之间的不匹配而效率低下。具体而言，需要将数十亿至数万亿个参数通过有限的内存带宽加载到GPU缓存中进行计算，但实际上只计算了一小批标记。因此，GPU大部分时间都花在内存传输上，而不是计算上。最近，并行解码，一种投机解码算法，变得越来越受欢迎，并在生成中展示出令人印象深刻的效率改进。它向大型模型引入额外的解码头，使它们能够同时预测多个后续标记，并在单个解码步骤中验证这些候选延续。然而，这种方法偏离了预训练期间用于下一个标记预测的训练目标，导致候选标记的低命中率。在本文中，我们提出了一种新的投机解码算法Clover，它将顺序知识整合到并行解码过程中。这种增强改善了投机者的命中率，从而提高了整体效率。Clover通过回归连接从预先推测的标记传递顺序知识，然后利用注意力解码器整合这些推测的标记。此外，Clover还包括一个增强块，修改隐藏状态以更好地与投机生成的目的对齐，而不是下一个标记预测。实验结果表明，Clover在Baichuan-Small上的性能比基线高出高达91％，在Baichuan-Large上高出146％，分别超过了之前性能最佳的方法Medusa在Baichuan-Small上高出37％，在Baichuan-Large上高出57％。

English

Large language models (LLMs) suffer from low efficiency as the mismatch between the requirement of auto-regressive decoding and the design of most contemporary GPUs. Specifically, billions to trillions of parameters must be loaded to the GPU cache through its limited memory bandwidth for computation, but only a small batch of tokens is actually computed. Consequently, the GPU spends most of its time on memory transfer instead of computation. Recently, parallel decoding, a type of speculative decoding algorithms, is becoming more popular and has demonstrated impressive efficiency improvement in generation. It introduces extra decoding heads to large models, enabling them to predict multiple subsequent tokens simultaneously and verify these candidate continuations in a single decoding step. However, this approach deviates from the training objective of next token prediction used during pre-training, resulting in a low hit rate for candidate tokens. In this paper, we propose a new speculative decoding algorithm, Clover, which integrates sequential knowledge into the parallel decoding process. This enhancement improves the hit rate of speculators and thus boosts the overall efficiency. Clover transmits the sequential knowledge from pre-speculated tokens via the Regressive Connection, then employs an Attention Decoder to integrate these speculated tokens. Additionally, Clover incorporates an Augmenting Block that modifies the hidden states to better align with the purpose of speculative generation rather than next token prediction. The experiment results demonstrate that Clover outperforms the baseline by up to 91% on Baichuan-Small and 146% on Baichuan-Large, respectively, and exceeds the performance of the previously top-performing method, Medusa, by up to 37% on Baichuan-Small and 57% on Baichuan-Large, respectively.

三叶草：具有顺序知识的轻量级回归推理解码

Clover: Regressive Lightweight Speculative Decoding with Sequential Knowledge

摘要

Support