ドラフトモデルはいつ停止すべきかを知っています：先読みデコーディングのための自己検証長ポリシー

要旨

先行推論（Speculative Decoding、SD）は、大規模言語モデルの推論速度を向上させる上で重要な技術となっています。従来のSD方法は、タスク間のトークン生成の難易度を無視した固定のドラフト長を採用しています。そのため、本論文ではこの問題に取り組み、SVIP（Speculative Decoding Systems向けの難易度を考慮した動的ドラフト長ポリシー）を導入します。ドラフトトークンの受容率の理論的下限とその推論時間の近似に基づき、SVIPは各ドラフトトークン分布のエントロピーに基づいてドラフトシーケンスの長さを適応的に決定します。主要なSDベンチマークとフレームワークでの実験結果は、SVIPの優れた性能を示し、SpecBenchにおいてベースラインのSD方法に比べて最大20\%のウォールタイム高速化を達成し、8Kトークンまでの長文生成においてMT-Benchにおいて60\%の高速化を実現しています。さらに、SVIPは完全にトレーニング不要であり、自己回帰的にドラフトトークンを生成する既存のSD方法と互換性があります。実験結果は、SVIPがGliDe＆CaPEおよびEAGLE-2に対しても一貫したウォールタイムの改善をもたらすことを示しています。

English

Speculative Decoding (SD) has become an important technique in accelerating the inference speed of large language models. Conventional SD methods employ a fixed draft length, which ignores the token generation difficulty across tasks. Consequently, in this paper, we address such an issue and introduce SVIP - a difficulty-aware dynamic draft length policy for speculative decoding systems. Based on a theoretical lower bound of draft token acceptance rate and its inference-time approximation, SVIP adaptively determines the lengths of draft sequences based on the entropy of each draft token distribution. Experimental results on mainstream SD benchmarks and frameworks demonstrate the superior performance of SVIP, achieving up to 20\% walltime speedup on SpecBench over baseline SD methods and 60\% speedup on MT-Bench for long-form generation of up to 8K tokens. Moreover, SVIP is totally training-free and compatible with any existing SD methods that generate draft tokens autoregressively. Experimental results also show that SVIP yields consistent walltime improvement on top of GliDe & CaPE and EAGLE-2.

ドラフトモデルはいつ停止すべきかを知っています：先読みデコーディングのための自己検証長ポリシー

Draft Model Knows When to Stop: A Self-Verification Length Policy for Speculative Decoding

要旨

Support