大規模言語モデルの損失なし高速化：仮説的デコーディングにおける時間的局所性に基づく階層的起案

要旨

大規模言語モデル（LLM）における推論の高速化はリアルタイムの相互作用において重要であり、これらは広く実世界のサービスに組み込まれています。推測的デコーディングは、トークンの起案と検証によって複数のトークンを単一の順方向パスで生成することで、推論速度の向上に貢献するとして、完全にアルゴリズムに基づく解決策として注目されています。ただし、現在の起案戦略は通常、かなりの微調整を必要とするか、タスク間で一貫性のないパフォーマンスを示します。これらの課題に対処するために、時間的局所性に基づいた階層的フレームワークに複数のデータベースに異なるトークンソースを整理する新しい損失のない起案手法であるHierarchy Drafting（HD）を提案します。起案ステップでは、HDは最も高い局所性から最も低い局所性までの複数のデータベースに順次アクセスして、異なるタスク間で一貫した加速を確保し、起案の遅延を最小限に抑えます。7Bおよび13Bのパラメータを持つLLMを使用したSpec-Benchでの実験では、HDが既存のデータベース起案手法を上回り、モデルサイズ、タスク、温度にわたって堅牢な推論の高速化を実現していることが示されました。

English

Accelerating inference in Large Language Models (LLMs) is critical for real-time interactions, as they have been widely incorporated into real-world services. Speculative decoding, a fully algorithmic solution, has gained attention for improving inference speed by drafting and verifying tokens, thereby generating multiple tokens in a single forward pass. However, current drafting strategies usually require significant fine-tuning or have inconsistent performance across tasks. To address these challenges, we propose Hierarchy Drafting (HD), a novel lossless drafting approach that organizes various token sources into multiple databases in a hierarchical framework based on temporal locality. In the drafting step, HD sequentially accesses multiple databases to obtain draft tokens from the highest to the lowest locality, ensuring consistent acceleration across diverse tasks and minimizing drafting latency. Our experiments on Spec-Bench using LLMs with 7B and 13B parameters demonstrate that HD outperforms existing database drafting methods, achieving robust inference speedups across model sizes, tasks, and temperatures.

大規模言語モデルの損失なし高速化：仮説的デコーディングにおける時間的局所性に基づく階層的起案

Lossless Acceleration of Large Language Models with Hierarchical Drafting based on Temporal Locality in Speculative Decoding

要旨

Support