OmniDraft: 온디바이스 추론적 디코딩을 위한 크로스 어휘 온라인 적응형 드래프터

초록

추론적 디코딩(speculative decoding)은 일반적으로 사전 학습되었거나 특정 대상 모델 시리즈(예: Llama 또는 Qwen 모델)로 오프라인에서 증류된 작고 효율적인 드래프트 모델을 사용하는 것을 전제로 한다. 그러나 온라인 배포 환경에서는 두 가지 주요 과제가 존재한다: 1) 드래프트 모델과 호환되지 않는 대상 모델의 사용; 2) 사용 및 시간에 따른 지연 시간 개선에 대한 기대. 본 연구에서는 단일 드래프트 모델이 모든 대상 모델과 함께 작동하고 사용자 데이터에 동적으로 적응할 수 있도록 하는 통합 프레임워크인 OmniDraft를 제안한다. 우리는 드래프트 모델과 대상 모델 간의 어휘 불일치를 해결하기 위해 하이브리드 증류 미세 조정(hybrid distillation fine-tuning)을 포함한 온라인 n-그램 캐시를 도입하고, 적응형 드래프팅 기술을 활용하여 디코딩 속도를 더욱 개선한다. OmniDraft는 모델 비용, 효율성 및 사용자 맞춤화가 주요 쟁점인 온디바이스 LLM 애플리케이션에 특히 적합하다. 이는 위의 과제를 해결할 필요성을 강조하고 "하나의 드래프터로 모든 모델을 지원"하는 패러다임을 촉진한다. 우리는 수학 추론, 코딩 및 텍스트 생성 작업에 대한 온라인 학습을 수행하여 OmniDraft 프레임워크의 능력을 입증한다. 특히, OmniDraft는 단일 Llama-68M 모델이 Vicuna-7B, Qwen2-7B 및 Llama3-8B 모델을 포함한 다양한 대상 모델과 함께 추론적 디코딩을 수행할 수 있도록 하며, 추가적으로 최대 1.5-2배의 속도 향상을 제공한다.

English

Speculative decoding generally dictates having a small, efficient draft model that is either pretrained or distilled offline to a particular target model series, for instance, Llama or Qwen models. However, within online deployment settings, there are two major challenges: 1) usage of a target model that is incompatible with the draft model; 2) expectation of latency improvements over usage and time. In this work, we propose OmniDraft, a unified framework that enables a single draft model to operate with any target model and adapt dynamically to user data. We introduce an online n-gram cache with hybrid distillation fine-tuning to address the cross-vocabulary mismatch across draft and target models; and further improve decoding speed by leveraging adaptive drafting techniques. OmniDraft is particularly suitable for on-device LLM applications where model cost, efficiency and user customization are the major points of contention. This further highlights the need to tackle the above challenges and motivates the ``one drafter for all'' paradigm. We showcase the proficiency of the OmniDraft framework by performing online learning on math reasoning, coding and text generation tasks. Notably, OmniDraft enables a single Llama-68M model to pair with various target models including Vicuna-7B, Qwen2-7B and Llama3-8B models for speculative decoding; and additionally provides up to 1.5-2x speedup.

OmniDraft: 온디바이스 추론적 디코딩을 위한 크로스 어휘 온라인 적응형 드래프터

OmniDraft: A Cross-vocabulary, Online Adaptive Drafter for On-device Speculative Decoding

초록

Support