EAGLE: 스펙큘레이티브 샘플링은 특징 불확실성 재고를 요구한다

초록

자동 회귀 디코딩은 대형 언어 모델(LLM)의 추론 과정을 시간 소모적으로 만듭니다. 본 연구에서는 무손실 가속을 위한 간단한 프레임워크인 EAGLE(Extrapolation Algorithm for Greater Language-model Efficiency)를 제안합니다. 기존의 추측적 샘플링 방법과 달리, EAGLE는 더 규칙적인(두 번째 상위 계층) 특징 수준에서 자동 회귀적으로 드래프팅 프로세스를 운영하며, 다음 특징 예측 문제에서의 샘플링 불확실성 문제를 한 단계 앞선 토큰을 통합하여 해결합니다. EAGLE가 제공하는 가속은 무손실입니다: 대상 LLM의 미세 조정이 필요 없으며, 생성된 텍스트는 일반적인 자동 회귀 디코딩과 동일한 분포를 유지합니다. 본 논문 제출 시점 기준으로, EAGLE는 추측적 샘플링 계열에서 가장 빠른 것으로 알려진 프레임워크입니다. MT-bench에서 EAGLE는 일반 디코딩보다 3배 빠르며, Lookahead보다 2배, Medusa보다 1.6배 빠릅니다. gpt-fast를 사용하여 EAGLE는 단일 RTX 3090 GPU에서 LLaMA2-Chat 13B를 기준으로 평균 160 토큰/초를 달성하며, 이는 Huggingface의 구현에서의 24 토큰/초와 비교됩니다.

English

Auto-regressive decoding makes the inference of Large Language Models (LLMs) time-consuming. We propose a simple framework, EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency), for lossless acceleration. Unlike traditional speculative sampling methods, EAGLE operates the drafting process auto-regressively at the more regular (second-top-layer) feature level and addresses the sampling uncertainty issues in the next-feature prediction problems by integrating tokens from one time step ahead. The acceleration provided by EAGLE is lossless: it involves no fine-tuning of the target LLM, and the generated text maintains the same distribution as that of vanilla auto-regressive decoding. As of the submission of this paper, EAGLE is the fastest known framework within the speculative sampling family. On MT-bench, EAGLE is 3x faster than vanilla decoding, 2x faster than Lookahead, and 1.6x faster than Medusa. Using gpt-fast, EAGLE attains on average 160 tokens/s with LLaMA2-Chat 13B on a single RTX 3090 GPU, compared to 24 tokens/s of Huggingface's implementations.

EAGLE: 스펙큘레이티브 샘플링은 특징 불확실성 재고를 요구한다

EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty

초록

Support