OmniDraft: デバイス上での推測的デコードのためのクロスボキャブラリ・オンライン適応型ドラフター

要旨

推測デコーディングでは、一般的に、事前学習済みまたは特定のターゲットモデルシリーズ（例えばLlamaやQwenモデル）に対してオフラインで蒸留された、小型で効率的なドラフトモデルを使用することが求められる。しかし、オンライン展開の設定においては、2つの主要な課題が存在する：1）ドラフトモデルと互換性のないターゲットモデルの使用；2）使用時間にわたるレイテンシ改善の期待。本研究では、単一のドラフトモデルが任意のターゲットモデルと連携し、ユーザーデータに動的に適応することを可能にする統一フレームワーク「OmniDraft」を提案する。ドラフトモデルとターゲットモデル間の語彙ミスマッチに対処するために、オンラインn-gramキャッシュとハイブリッド蒸留ファインチューニングを導入し、さらに適応型ドラフティング技術を活用してデコーディング速度を向上させる。OmniDraftは、モデルコスト、効率性、ユーザーカスタマイズが主要な争点となるオンデバイスLLMアプリケーションに特に適している。これにより、上記の課題に取り組む必要性が強調され、「すべてに対応する単一のドラフター」というパラダイムが動機付けられる。数学的推論、コーディング、テキスト生成タスクにおけるオンライン学習を通じて、OmniDraftフレームワークの有効性を実証する。特に、OmniDraftは、単一のLlama-68MモデルがVicuna-7B、Qwen2-7B、Llama3-8Bモデルなど様々なターゲットモデルと推測デコーディングを実行することを可能にし、さらに1.5～2倍の高速化を実現する。

English

Speculative decoding generally dictates having a small, efficient draft model that is either pretrained or distilled offline to a particular target model series, for instance, Llama or Qwen models. However, within online deployment settings, there are two major challenges: 1) usage of a target model that is incompatible with the draft model; 2) expectation of latency improvements over usage and time. In this work, we propose OmniDraft, a unified framework that enables a single draft model to operate with any target model and adapt dynamically to user data. We introduce an online n-gram cache with hybrid distillation fine-tuning to address the cross-vocabulary mismatch across draft and target models; and further improve decoding speed by leveraging adaptive drafting techniques. OmniDraft is particularly suitable for on-device LLM applications where model cost, efficiency and user customization are the major points of contention. This further highlights the need to tackle the above challenges and motivates the ``one drafter for all'' paradigm. We showcase the proficiency of the OmniDraft framework by performing online learning on math reasoning, coding and text generation tasks. Notably, OmniDraft enables a single Llama-68M model to pair with various target models including Vicuna-7B, Qwen2-7B and Llama3-8B models for speculative decoding; and additionally provides up to 1.5-2x speedup.

OmniDraft: デバイス上での推測的デコードのためのクロスボキャブラリ・オンライン適応型ドラフター

OmniDraft: A Cross-vocabulary, Online Adaptive Drafter for On-device Speculative Decoding

要旨

Support