EAGLE：推测抽样需要重新思考特征不确定性

摘要

自回归解码使得推断大型语言模型（LLMs）变得耗时。我们提出了一个简单的框架，EAGLE（Greater Language-model Efficiency的Extrapolation Algorithm），用于无损加速。与传统的猜测抽样方法不同，EAGLE在更规则（第二顶层）的特征级别上自回归地进行起草过程，并通过整合提前一个时间步的标记来解决下一个特征预测问题中的抽样不确定性问题。EAGLE提供的加速是无损的：它不涉及对目标LLM的微调，并且生成的文本保持与普通自回归解码相同的分布。截至本文提交时，EAGLE是已知的猜测抽样家族中速度最快的框架。在MT-bench上，EAGLE比普通解码快3倍，比Lookahead快2倍，比Medusa快1.6倍。使用gpt-fast，EAGLE在单个RTX 3090 GPU上平均达到每秒160个标记，而Huggingface的实现为每秒24个标记。

English

Auto-regressive decoding makes the inference of Large Language Models (LLMs) time-consuming. We propose a simple framework, EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency), for lossless acceleration. Unlike traditional speculative sampling methods, EAGLE operates the drafting process auto-regressively at the more regular (second-top-layer) feature level and addresses the sampling uncertainty issues in the next-feature prediction problems by integrating tokens from one time step ahead. The acceleration provided by EAGLE is lossless: it involves no fine-tuning of the target LLM, and the generated text maintains the same distribution as that of vanilla auto-regressive decoding. As of the submission of this paper, EAGLE is the fastest known framework within the speculative sampling family. On MT-bench, EAGLE is 3x faster than vanilla decoding, 2x faster than Lookahead, and 1.6x faster than Medusa. Using gpt-fast, EAGLE attains on average 160 tokens/s with LLaMA2-Chat 13B on a single RTX 3090 GPU, compared to 24 tokens/s of Huggingface's implementations.

EAGLE：推测抽样需要重新思考特征不确定性

EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty

摘要

Support