OmniDraft：一种跨词汇、在线自适应的设备端推测解码加速器

摘要

推测性解码通常要求配备一个小型高效的草稿模型，该模型需预先训练或针对特定目标模型系列（如Llama或Qwen模型）进行离线蒸馏。然而，在在线部署环境中，面临两大挑战：1）使用与草稿模型不兼容的目标模型；2）期望在使用过程中及随时间推移能实现延迟的改善。本研究中，我们提出了OmniDraft，一个统一框架，使单一草稿模型能够与任何目标模型协同工作，并动态适应用户数据。我们引入了在线n-gram缓存结合混合蒸馏微调，以解决草稿模型与目标模型间的跨词汇表不匹配问题；并通过自适应草稿技术进一步提升了解码速度。OmniDraft特别适用于设备端大语言模型应用，其中模型成本、效率及用户定制化是主要争议点。这进一步凸显了解决上述挑战的必要性，并推动了“一稿通吃”范式的提出。我们通过在数学推理、编码及文本生成任务上实施在线学习，展示了OmniDraft框架的卓越能力。值得注意的是，OmniDraft使得单个Llama-68M模型能够与包括Vicuna-7B、Qwen2-7B及Llama3-8B在内的多种目标模型配对进行推测性解码，并额外提供了高达1.5至2倍的加速效果。

English

Speculative decoding generally dictates having a small, efficient draft model that is either pretrained or distilled offline to a particular target model series, for instance, Llama or Qwen models. However, within online deployment settings, there are two major challenges: 1) usage of a target model that is incompatible with the draft model; 2) expectation of latency improvements over usage and time. In this work, we propose OmniDraft, a unified framework that enables a single draft model to operate with any target model and adapt dynamically to user data. We introduce an online n-gram cache with hybrid distillation fine-tuning to address the cross-vocabulary mismatch across draft and target models; and further improve decoding speed by leveraging adaptive drafting techniques. OmniDraft is particularly suitable for on-device LLM applications where model cost, efficiency and user customization are the major points of contention. This further highlights the need to tackle the above challenges and motivates the ``one drafter for all'' paradigm. We showcase the proficiency of the OmniDraft framework by performing online learning on math reasoning, coding and text generation tasks. Notably, OmniDraft enables a single Llama-68M model to pair with various target models including Vicuna-7B, Qwen2-7B and Llama3-8B models for speculative decoding; and additionally provides up to 1.5-2x speedup.

OmniDraft：一种跨词汇、在线自适应的设备端推测解码加速器

OmniDraft: A Cross-vocabulary, Online Adaptive Drafter for On-device Speculative Decoding

摘要

Support