ChatPaper.aiChatPaper

OmniDraft:一款跨词汇、在线自适应的设备端推测解码加速器

OmniDraft: A Cross-vocabulary, Online Adaptive Drafter for On-device Speculative Decoding

July 3, 2025
作者: Ramchalam Kinattinkara Ramakrishnan, Zhaocong Yuan, Shaojie Zhuo, Chen Feng, Yicheng Lin, Chenzheng Su, Xiaopeng Zhang
cs.AI

摘要

推測解碼通常要求配備一個小型且高效的草稿模型,該模型需預先訓練或離線蒸餾至特定目標模型系列,例如Llama或Qwen模型。然而,在線上部署環境中,存在兩大挑戰:一是使用與草稿模型不相容的目標模型;二是期望在使用過程中及隨時間推移能改善延遲。本研究提出OmniDraft,這是一個統一框架,使單一草稿模型能與任何目標模型協同工作,並動態適應用戶數據。我們引入了一種在線n-gram緩存結合混合蒸餾微調的方法,以解決草稿模型與目標模型之間的跨詞彙表不匹配問題;並通過利用自適應推測技術進一步提升解碼速度。OmniDraft尤其適用於設備端大型語言模型(LLM)應用,其中模型成本、效率及用戶定制化是主要爭議點。這進一步凸顯了應對上述挑戰的必要性,並推動了“一草稿模型適用所有”的範式轉變。我們通過在數學推理、編碼及文本生成任務上進行在線學習,展示了OmniDraft框架的卓越能力。值得注意的是,OmniDraft使單一的Llama-68M模型能夠與包括Vicuna-7B、Qwen2-7B及Llama3-8B在內的多種目標模型配對進行推測解碼;並額外提供了高達1.5至2倍的加速效果。
English
Speculative decoding generally dictates having a small, efficient draft model that is either pretrained or distilled offline to a particular target model series, for instance, Llama or Qwen models. However, within online deployment settings, there are two major challenges: 1) usage of a target model that is incompatible with the draft model; 2) expectation of latency improvements over usage and time. In this work, we propose OmniDraft, a unified framework that enables a single draft model to operate with any target model and adapt dynamically to user data. We introduce an online n-gram cache with hybrid distillation fine-tuning to address the cross-vocabulary mismatch across draft and target models; and further improve decoding speed by leveraging adaptive drafting techniques. OmniDraft is particularly suitable for on-device LLM applications where model cost, efficiency and user customization are the major points of contention. This further highlights the need to tackle the above challenges and motivates the ``one drafter for all'' paradigm. We showcase the proficiency of the OmniDraft framework by performing online learning on math reasoning, coding and text generation tasks. Notably, OmniDraft enables a single Llama-68M model to pair with various target models including Vicuna-7B, Qwen2-7B and Llama3-8B models for speculative decoding; and additionally provides up to 1.5-2x speedup.
PDF101July 8, 2025