VOCABTRIM: 大規模言語モデルにおける効率的な推測的デコーディングのための語彙プルーニング

要旨

本論文では、ドラフトベースの推測的デコード（SpD）手法の性能を向上させるためのシンプルなトレーニング不要の技術を紹介する。この技術は、ドラフトプロセス中に言語モデリングヘッド（LMヘッド）を組み込むものである。ドラフトベースの推測的デコードは、1つ以上の小さな言語モデル（ドラフターまたはドラフトモデル）を活用して、複数のトークンからなるドラフトシーケンスまたはツリーをサンプリングし、その後、ベースLLM（ターゲットモデル）による検証を行い、その一部を有効な生成として受け入れる。通常、推測的デコードにはターゲットモデルとドラフトモデルの語彙間の1対1マッピングが必要とされるため、それらの間で語彙を共有するか、EAGLEやMedusaのようにLMヘッドを共有することが自然と考えられてきた。我々はまず、このドラフトトークンサンプリングスキームが、特に非常に大きな語彙を持つ一部のターゲットLLMにおいて、ドラフトプロセスに不必要な推論オーバーヘッドを含んでいることを指摘する。次に、メモリ制約環境下での生成速度を向上させるために、ドラフトオーバーヘッドを軽減するシンプルな技術、VocabTrimを提案する。VocabTrimは、ドラフターのLMヘッドを再構築し、ターゲットモデルの語彙から最も頻繁にサンプリングされるトークンのみを含むように制限する。ドラフト中の語彙を制限することは受け入れ率をわずかに低下させるが、エッジデバイスでよく見られるメモリ制約プロセスにおけるドラフト遅延を大幅に削減し、結果としてメモリ制約速度向上（MBSU）を高める。我々の手法が、Spec-Bench上のLlama-3モデル、特にLlama-3.2-3B-Instructにおいて、メモリ制約速度向上を16%向上させることを示す。

English

In this paper, we introduce a simple training-free technique to improve the performance of drafter-based speculative decoding (SpD) methods that incorporates language modeling head (LM head) during drafting process. A drafter-based speculative decoding leverages one or more smaller language models, a.k.a. drafters or draft models, to sample a draft sequence or tree consisting of multiple tokens, followed by verification by a base LLM, a target model, accepting a subset as its valid generation. As it is usually considered that the speculative decoding requires one-to-one mapping between vocabularies of the target model and the draft model, it has been natural to share the vocabulary between them, or even share the LM head as in EAGLE or Medusa. We first identify that this draft token sampling scheme inherently contains an unnecessary inference overhead in drafting, especially for some target LLMs with very large vocabularies. Then, we propose a simple technique, VocabTrim, to mitigate the drafting overhead to improve the generation speed in memory-bound environment. VocabTrim reconstructs the drafter LM head to contain only a limited set of tokens, selected by the most frequently sampled from the vocabulary of the target model. While limiting the vocabulary in drafting slightly degrades the acceptance rate, it significantly reduces the drafting latency in memory-bound process which is often the case on edge devices, resulting in higher memory-bound speed up (MBSU). We show that our method can boost the memory-bound speed-up for Llama-3 models on Spec-Bench, specifically by 16% for Llama-3.2-3B-Instruct.

VOCABTRIM: 大規模言語モデルにおける効率的な推測的デコーディングのための語彙プルーニング

VOCABTRIM: Vocabulary Pruning for Efficient Speculative Decoding in LLMs

要旨

Support