VOCABTRIM：面向大语言模型高效推测解码的词汇剪枝技术

摘要

本文提出了一种无需训练的简单技术，旨在提升基于草稿模型的推测解码（SpD）方法的性能，该技术在草稿生成过程中整合了语言模型头（LM头）。基于草稿模型的推测解码利用一个或多个较小的语言模型（即草稿模型或草稿生成器）来采样由多个令牌组成的草稿序列或树，随后由基础大语言模型（目标模型）进行验证，接受其中一部分作为有效生成。通常认为，推测解码要求目标模型与草稿模型的词汇表之间存在一一映射关系，因此自然地在两者之间共享词汇表，甚至如EAGLE或Medusa那样共享LM头。我们首先指出，这种草稿令牌采样方案在草稿生成过程中固有地包含了不必要的推理开销，尤其对于某些拥有极大词汇表的目标大语言模型而言。接着，我们提出了一种名为VocabTrim的简单技术，以减少草稿生成的开销，从而在内存受限的环境中提升生成速度。VocabTrim通过重构草稿模型的LM头，使其仅包含从目标模型词汇表中选取的最频繁采样的有限令牌集。尽管在草稿生成中限制词汇表会略微降低接受率，但它显著减少了在内存受限进程中的草稿生成延迟，这在边缘设备上尤为常见，从而实现了更高的内存受限加速比（MBSU）。我们展示了该方法能够在Spec-Bench基准上为Llama-3模型带来内存受限加速，特别是对于Llama-3.2-3B-Instruct模型，加速效果提升了16%。

English

In this paper, we introduce a simple training-free technique to improve the performance of drafter-based speculative decoding (SpD) methods that incorporates language modeling head (LM head) during drafting process. A drafter-based speculative decoding leverages one or more smaller language models, a.k.a. drafters or draft models, to sample a draft sequence or tree consisting of multiple tokens, followed by verification by a base LLM, a target model, accepting a subset as its valid generation. As it is usually considered that the speculative decoding requires one-to-one mapping between vocabularies of the target model and the draft model, it has been natural to share the vocabulary between them, or even share the LM head as in EAGLE or Medusa. We first identify that this draft token sampling scheme inherently contains an unnecessary inference overhead in drafting, especially for some target LLMs with very large vocabularies. Then, we propose a simple technique, VocabTrim, to mitigate the drafting overhead to improve the generation speed in memory-bound environment. VocabTrim reconstructs the drafter LM head to contain only a limited set of tokens, selected by the most frequently sampled from the vocabulary of the target model. While limiting the vocabulary in drafting slightly degrades the acceptance rate, it significantly reduces the drafting latency in memory-bound process which is often the case on edge devices, resulting in higher memory-bound speed up (MBSU). We show that our method can boost the memory-bound speed-up for Llama-3 models on Spec-Bench, specifically by 16% for Llama-3.2-3B-Instruct.

VOCABTRIM：面向大语言模型高效推测解码的词汇剪枝技术

VOCABTRIM: Vocabulary Pruning for Efficient Speculative Decoding in LLMs

摘要

Support