朝向快速多語言LLM推論:推測解碼與專用起草者
Towards Fast Multilingual LLM Inference: Speculative Decoding and Specialized Drafters
June 24, 2024
作者: Euiin Yi, Taehyeon Kim, Hongseok Jeung, Du-Seong Chang, Se-Young Yun
cs.AI
摘要
大型語言模型(LLMs)已經革新了自然語言處理並擴展了它們在各種商業應用中的應用範圍。然而,在多語言環境中,這些模型的部署受到推論時間過長的限制。為了解決這一挑戰,本文探討了一種輔助模型在推測解碼中的訓練方法,這些方法被用來起草,然後其未來的標記由目標LLM驗證。我們展示了通過針對性的預訓練和微調策略優化的特定語言起草模型,相對於以前的方法,顯著提高了推論時間。我們在各種語言中驗證了這些模型的推論時間、跨領域加速和GPT-4o評估。
English
Large language models (LLMs) have revolutionized natural language processing
and broadened their applicability across diverse commercial applications.
However, the deployment of these models is constrained by high inference time
in multilingual settings. To mitigate this challenge, this paper explores a
training recipe of an assistant model in speculative decoding, which are
leveraged to draft and-then its future tokens are verified by the target LLM.
We show that language-specific draft models, optimized through a targeted
pretrain-and-finetune strategy, substantially brings a speedup of inference
time compared to the previous methods. We validate these models across various
languages in inference time, out-of-domain speedup, and GPT-4o evaluation.Summary
AI-Generated Summary