EAGLE：推測抽樣需要重新思考特徵不確定性

摘要

自回歸解碼使得大型語言模型（LLMs）的推論變得耗時。我們提出了一個簡單的框架，名為EAGLE（Extrapolation Algorithm for Greater Language-model Efficiency），用於無損加速。與傳統的推測抽樣方法不同，EAGLE在更規則的（第二頂層）特徵層級自回歸地運作起草過程，並通過整合提前一個時間步的標記來解決下一個特徵預測問題中的抽樣不確定性問題。EAGLE提供的加速是無損的：它不涉及對目標LLM的微調，生成的文本保持與純自回歸解碼相同的分佈。截至本文提交時，EAGLE是推測抽樣家族中已知最快的框架。在MT-bench上，EAGLE比純解碼快3倍，比Lookahead快2倍，比Medusa快1.6倍。使用gpt-fast，EAGLE在單個RTX 3090 GPU上的LLaMA2-Chat 13B平均達到每秒160個標記，而Huggingface的實現為每秒24個標記。

English

Auto-regressive decoding makes the inference of Large Language Models (LLMs) time-consuming. We propose a simple framework, EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency), for lossless acceleration. Unlike traditional speculative sampling methods, EAGLE operates the drafting process auto-regressively at the more regular (second-top-layer) feature level and addresses the sampling uncertainty issues in the next-feature prediction problems by integrating tokens from one time step ahead. The acceleration provided by EAGLE is lossless: it involves no fine-tuning of the target LLM, and the generated text maintains the same distribution as that of vanilla auto-regressive decoding. As of the submission of this paper, EAGLE is the fastest known framework within the speculative sampling family. On MT-bench, EAGLE is 3x faster than vanilla decoding, 2x faster than Lookahead, and 1.6x faster than Medusa. Using gpt-fast, EAGLE attains on average 160 tokens/s with LLaMA2-Chat 13B on a single RTX 3090 GPU, compared to 24 tokens/s of Huggingface's implementations.

EAGLE：推測抽樣需要重新思考特徵不確定性

EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty

摘要

Support