빠른 다국어 LLM 추론을 위한 방법: 스펙큘레이티브 디코딩과 전용 드래프터

초록

대형 언어 모델(LLMs)은 자연어 처리 분야를 혁신하고 다양한 상업적 응용 분야에서 그 활용 범위를 확장해 왔습니다. 그러나 이러한 모델의 배포는 다국어 환경에서의 높은 추론 시간으로 인해 제약을 받고 있습니다. 이러한 문제를 완화하기 위해, 본 논문은 스펙티브 디코딩(speculative decoding)에서 어시스턴트 모델의 학습 방법을 탐구합니다. 이 방법은 초안을 작성한 후 대상 LLM에 의해 미래 토큰을 검증하는 방식으로 활용됩니다. 우리는 특정 언어에 맞게 최적화된 초안 모델이 목표 지향적인 사전 학습 및 미세 조정 전략을 통해 이전 방법들에 비해 추론 시간을 크게 단축할 수 있음을 보여줍니다. 이러한 모델들을 다양한 언어에서 추론 시간, 도메인 외 속도 향상, 그리고 GPT-4o 평가를 통해 검증합니다.

English

Large language models (LLMs) have revolutionized natural language processing and broadened their applicability across diverse commercial applications. However, the deployment of these models is constrained by high inference time in multilingual settings. To mitigate this challenge, this paper explores a training recipe of an assistant model in speculative decoding, which are leveraged to draft and-then its future tokens are verified by the target LLM. We show that language-specific draft models, optimized through a targeted pretrain-and-finetune strategy, substantially brings a speedup of inference time compared to the previous methods. We validate these models across various languages in inference time, out-of-domain speedup, and GPT-4o evaluation.

빠른 다국어 LLM 추론을 위한 방법: 스펙큘레이티브 디코딩과 전용 드래프터

Towards Fast Multilingual LLM Inference: Speculative Decoding and Specialized Drafters

초록

Support