EAGLE-3：通過訓練時測試擴展大型語言模型的推理加速

摘要

現代大型語言模型（LLM）的序列化特性使其成本高昂且速度緩慢，而推測性採樣已被證明是解決這一問題的有效方案。諸如EAGLE等方法在特徵層面進行自回歸，通過重用目標模型的頂層特徵，取得了比基礎推測性採樣更好的效果。LLM社群中一個日益增長的趨勢是擴大訓練數據規模，以在不增加推理成本的情況下提升模型智能。然而，我們觀察到，擴大數據規模對EAGLE的改進效果有限。我們發現這一限制源於EAGLE的特徵預測約束。本文中，我們介紹了EAGLE-3，它放棄了特徵預測，轉而直接進行詞元預測，並通過一種名為訓練時測試的技術，用多層特徵融合取代了對頂層特徵的依賴。這些改進顯著提升了性能，使草稿模型能夠充分利用擴大訓練數據規模帶來的好處。我們的實驗涵蓋了聊天模型和推理模型，並在五項任務上進行了評估。結果顯示，EAGLE-3實現了最高6.5倍的加速比，相比EAGLE-2提升了約1.4倍。代碼可在https://github.com/SafeAILab/EAGLE獲取。

English

The sequential nature of modern LLMs makes them expensive and slow, and speculative sampling has proven to be an effective solution to this problem. Methods like EAGLE perform autoregression at the feature level, reusing top-layer features from the target model to achieve better results than vanilla speculative sampling. A growing trend in the LLM community is scaling up training data to improve model intelligence without increasing inference costs. However, we observe that scaling up data provides limited improvements for EAGLE. We identify that this limitation arises from EAGLE's feature prediction constraints. In this paper, we introduce EAGLE-3, which abandons feature prediction in favor of direct token prediction and replaces reliance on top-layer features with multi-layer feature fusion via a technique named training-time test. These improvements significantly enhance performance and enable the draft model to fully benefit from scaling up training data. Our experiments include both chat models and reasoning models, evaluated on five tasks. The results show that EAGLE-3 achieves a speedup ratio up to 6.5x, with about 1.4x improvement over EAGLE-2. The code is available at https://github.com/SafeAILab/EAGLE.

EAGLE-3：通過訓練時測試擴展大型語言模型的推理加速

EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test

摘要

Support