VOCABTRIM:大型語言模型中高效推測解碼的詞彙修剪技術
VOCABTRIM: Vocabulary Pruning for Efficient Speculative Decoding in LLMs
June 28, 2025
作者: Raghavv Goel, Sudhanshu Agrawal, Mukul Gagrani, Junyoung Park, Yifan Zao, He Zhang, Tian Liu, Yiping Yang, Xin Yuan, Jiuyan Lu, Chris Lott, Mingu Lee
cs.AI
摘要
本文提出了一種簡單的無需訓練的技術,旨在提升基於草稿機的推測解碼(SpD)方法的性能,該方法在草稿生成過程中整合了語言模型頭部(LM頭部)。基於草稿機的推測解碼利用一個或多個較小的語言模型,即草稿機或草稿模型,來採樣由多個令牌組成的草稿序列或樹,隨後由基礎大語言模型(LLM),即目標模型,進行驗證並接受其中一部分作為其有效生成。通常認為,推測解碼要求目標模型與草稿模型的詞彙表之間存在一一對應關係,因此自然會共享它們的詞彙表,甚至如EAGLE或Medusa中那樣共享LM頭部。我們首先指出,這種草稿令牌採樣方案在草稿生成過程中本質上包含了一種不必要的推理開銷,特別是對於一些擁有極大詞彙量的目標LLM而言。接著,我們提出了一種名為VocabTrim的簡單技術,以減輕草稿生成開銷,從而提升在內存受限環境下的生成速度。VocabTrim重構了草稿機的LM頭部,使其僅包含一組有限的令牌,這些令牌是根據目標模型詞彙表中最頻繁採樣的令牌選取的。雖然在草稿生成中限制詞彙量會略微降低接受率,但它顯著減少了在內存受限過程中的草稿生成延遲,這在邊緣設備上尤為常見,從而實現了更高的內存受限加速比(MBSU)。我們展示了該方法能夠在Spec-Bench上提升Llama-3模型的內存受限加速比,特別是對於Llama-3.2-3B-Instruct模型,提升了16%。
English
In this paper, we introduce a simple training-free technique to improve the
performance of drafter-based speculative decoding (SpD) methods that
incorporates language modeling head (LM head) during drafting process. A
drafter-based speculative decoding leverages one or more smaller language
models, a.k.a. drafters or draft models, to sample a draft sequence or tree
consisting of multiple tokens, followed by verification by a base LLM, a target
model, accepting a subset as its valid generation. As it is usually considered
that the speculative decoding requires one-to-one mapping between vocabularies
of the target model and the draft model, it has been natural to share the
vocabulary between them, or even share the LM head as in EAGLE or Medusa. We
first identify that this draft token sampling scheme inherently contains an
unnecessary inference overhead in drafting, especially for some target LLMs
with very large vocabularies. Then, we propose a simple technique, VocabTrim,
to mitigate the drafting overhead to improve the generation speed in
memory-bound environment. VocabTrim reconstructs the drafter LM head to contain
only a limited set of tokens, selected by the most frequently sampled from the
vocabulary of the target model. While limiting the vocabulary in drafting
slightly degrades the acceptance rate, it significantly reduces the drafting
latency in memory-bound process which is often the case on edge devices,
resulting in higher memory-bound speed up (MBSU). We show that our method can
boost the memory-bound speed-up for Llama-3 models on Spec-Bench, specifically
by 16% for Llama-3.2-3B-Instruct.