ChatPaper.aiChatPaper

OmniDraft:一种跨词汇、在线自适应的设备端推测解码加速器

OmniDraft: A Cross-vocabulary, Online Adaptive Drafter for On-device Speculative Decoding

July 3, 2025
作者: Ramchalam Kinattinkara Ramakrishnan, Zhaocong Yuan, Shaojie Zhuo, Chen Feng, Yicheng Lin, Chenzheng Su, Xiaopeng Zhang
cs.AI

摘要

推测性解码通常要求配备一个小型高效的草稿模型,该模型需预先训练或针对特定目标模型系列(如Llama或Qwen模型)进行离线蒸馏。然而,在在线部署环境中,面临两大挑战:1)使用与草稿模型不兼容的目标模型;2)期望在使用过程中及随时间推移能实现延迟的改善。本研究中,我们提出了OmniDraft,一个统一框架,使单一草稿模型能够与任何目标模型协同工作,并动态适应用户数据。我们引入了在线n-gram缓存结合混合蒸馏微调,以解决草稿模型与目标模型间的跨词汇表不匹配问题;并通过自适应草稿技术进一步提升了解码速度。OmniDraft特别适用于设备端大语言模型应用,其中模型成本、效率及用户定制化是主要争议点。这进一步凸显了解决上述挑战的必要性,并推动了“一稿通吃”范式的提出。我们通过在数学推理、编码及文本生成任务上实施在线学习,展示了OmniDraft框架的卓越能力。值得注意的是,OmniDraft使得单个Llama-68M模型能够与包括Vicuna-7B、Qwen2-7B及Llama3-8B在内的多种目标模型配对进行推测性解码,并额外提供了高达1.5至2倍的加速效果。
English
Speculative decoding generally dictates having a small, efficient draft model that is either pretrained or distilled offline to a particular target model series, for instance, Llama or Qwen models. However, within online deployment settings, there are two major challenges: 1) usage of a target model that is incompatible with the draft model; 2) expectation of latency improvements over usage and time. In this work, we propose OmniDraft, a unified framework that enables a single draft model to operate with any target model and adapt dynamically to user data. We introduce an online n-gram cache with hybrid distillation fine-tuning to address the cross-vocabulary mismatch across draft and target models; and further improve decoding speed by leveraging adaptive drafting techniques. OmniDraft is particularly suitable for on-device LLM applications where model cost, efficiency and user customization are the major points of contention. This further highlights the need to tackle the above challenges and motivates the ``one drafter for all'' paradigm. We showcase the proficiency of the OmniDraft framework by performing online learning on math reasoning, coding and text generation tasks. Notably, OmniDraft enables a single Llama-68M model to pair with various target models including Vicuna-7B, Qwen2-7B and Llama3-8B models for speculative decoding; and additionally provides up to 1.5-2x speedup.
PDF01July 8, 2025