OmniDraft：一款跨词汇、在线自适应的设备端推测解码加速器

摘要

推測解碼通常要求配備一個小型且高效的草稿模型，該模型需預先訓練或離線蒸餾至特定目標模型系列，例如Llama或Qwen模型。然而，在線上部署環境中，存在兩大挑戰：一是使用與草稿模型不相容的目標模型；二是期望在使用過程中及隨時間推移能改善延遲。本研究提出OmniDraft，這是一個統一框架，使單一草稿模型能與任何目標模型協同工作，並動態適應用戶數據。我們引入了一種在線n-gram緩存結合混合蒸餾微調的方法，以解決草稿模型與目標模型之間的跨詞彙表不匹配問題；並通過利用自適應推測技術進一步提升解碼速度。OmniDraft尤其適用於設備端大型語言模型（LLM）應用，其中模型成本、效率及用戶定制化是主要爭議點。這進一步凸顯了應對上述挑戰的必要性，並推動了“一草稿模型適用所有”的範式轉變。我們通過在數學推理、編碼及文本生成任務上進行在線學習，展示了OmniDraft框架的卓越能力。值得注意的是，OmniDraft使單一的Llama-68M模型能夠與包括Vicuna-7B、Qwen2-7B及Llama3-8B在內的多種目標模型配對進行推測解碼；並額外提供了高達1.5至2倍的加速效果。

English

Speculative decoding generally dictates having a small, efficient draft model that is either pretrained or distilled offline to a particular target model series, for instance, Llama or Qwen models. However, within online deployment settings, there are two major challenges: 1) usage of a target model that is incompatible with the draft model; 2) expectation of latency improvements over usage and time. In this work, we propose OmniDraft, a unified framework that enables a single draft model to operate with any target model and adapt dynamically to user data. We introduce an online n-gram cache with hybrid distillation fine-tuning to address the cross-vocabulary mismatch across draft and target models; and further improve decoding speed by leveraging adaptive drafting techniques. OmniDraft is particularly suitable for on-device LLM applications where model cost, efficiency and user customization are the major points of contention. This further highlights the need to tackle the above challenges and motivates the ``one drafter for all'' paradigm. We showcase the proficiency of the OmniDraft framework by performing online learning on math reasoning, coding and text generation tasks. Notably, OmniDraft enables a single Llama-68M model to pair with various target models including Vicuna-7B, Qwen2-7B and Llama3-8B models for speculative decoding; and additionally provides up to 1.5-2x speedup.

OmniDraft：一款跨词汇、在线自适应的设备端推测解码加速器

OmniDraft: A Cross-vocabulary, Online Adaptive Drafter for On-device Speculative Decoding

摘要

Support