ChatPaper.aiChatPaper

实现快速多语言LLM推断:推测解码与专用起草者

Towards Fast Multilingual LLM Inference: Speculative Decoding and Specialized Drafters

June 24, 2024
作者: Euiin Yi, Taehyeon Kim, Hongseok Jeung, Du-Seong Chang, Se-Young Yun
cs.AI

摘要

大型语言模型(LLMs)已经彻底改变了自然语言处理,并扩展了它们在各种商业应用中的适用性。然而,在多语言环境中,这些模型的部署受到推理时间较长的限制。为了缓解这一挑战,本文探讨了一种助理模型在推测解码中的训练方法,这些方法被用来起草,然后其未来的标记由目标LLM验证。我们展示了通过有针对性的预训练和微调策略优化的特定语言起草模型,与先前方法相比,显著加快了推理时间。我们验证了这些模型在各种语言中的推理时间、领域外加速和GPT-4o评估。
English
Large language models (LLMs) have revolutionized natural language processing and broadened their applicability across diverse commercial applications. However, the deployment of these models is constrained by high inference time in multilingual settings. To mitigate this challenge, this paper explores a training recipe of an assistant model in speculative decoding, which are leveraged to draft and-then its future tokens are verified by the target LLM. We show that language-specific draft models, optimized through a targeted pretrain-and-finetune strategy, substantially brings a speedup of inference time compared to the previous methods. We validate these models across various languages in inference time, out-of-domain speedup, and GPT-4o evaluation.

Summary

AI-Generated Summary

PDF203November 29, 2024