POSS: ポジションスペシャリストが推測デコーディングのためのより良いドラフトを生成

要旨

推測的デコーディングは、小さなドラフトモデルを使用して複数のトークンを予測し、大きなターゲットモデルを使用してこれらのトークンを並列に検証することで、大規模言語モデル（LLM）の推論を加速します。最近の研究では、ターゲットモデルの隠れ状態を活用してドラフトモデルの予測精度を向上させています。しかし、既存の手法では、ドラフトモデルが生成する特徴量における誤差の蓄積により、後続の位置におけるドラフトトークンの予測品質が低下する問題があります。本論文では、指定された位置（複数可）でトークンを生成するために、複数の位置特化ドラフト層で構成されるPosition Specialists（PosS）を提案します。Position Specialistsは、各スペシャリストが特定のレベルのドラフトモデル特徴量の偏差にのみ焦点を当てるため、ドラウンドごとの後続位置におけるトークンの受容率を大幅に向上させます。Llama-3-8B-InstructおよびLlama-2-13B-chatを用いた6つのデータセットでの実験結果は、PosSが平均受容長と高速化率においてベースラインを効果的に改善することを示しています。私たちのコードベースはhttps://github.com/shrango/PosSで公開されています。

English

Speculative decoding accelerates Large Language Model (LLM) inference by using a small draft model to predict multiple tokens, and a large target model to verify these tokens in parallel. Recent studies leverage the hidden state of the target model to enhance draft model prediction accuracy. However, existing methods suffer from the degrading quality of draft token predictions at later positions, due to error accumulation in draft model generated features. In this paper, we propose Position Specialists (PosS), which consist of multiple position-specialized draft layers to generate tokens at assigned position(s). Position specialists greatly improve token acceptance rate at later positions per drafting round, as each specialist only needs to focus on handling a certain level of draft model feature deviation. Experiment results on Llama-3-8B-Instruct and Llama-2-13B-chat across six datasets demonstrate that PosS effectively improves over baselines on average acceptance length and speed-up ratio. Our codebase is available at https://github.com/shrango/PosS.

POSS: ポジションスペシャリストが推測デコーディングのためのより良いドラフトを生成

POSS: Position Specialist Generates Better Draft for Speculative Decoding

要旨

Support