EAGLE-3: トレーニングタイムテストによる大規模言語モデルの推論加速のスケールアップ

要旨

現代の大規模言語モデル（LLM）は逐次的な性質を持つため、コストが高く処理速度も遅いが、この問題に対する有効な解決策として推測サンプリングが証明されている。EAGLEのような手法は、特徴量レベルで自己回帰を行い、ターゲットモデルの最上位層の特徴量を再利用することで、従来の推測サンプリングよりも優れた結果を達成している。LLMコミュニティでは、推論コストを増やすことなくモデルの知能を向上させるために、トレーニングデータをスケールアップする傾向が高まっている。しかし、我々は、データをスケールアップしてもEAGLEの改善が限定的であることを観察した。この制限は、EAGLEの特徴量予測の制約に起因していると特定した。本論文では、EAGLE-3を紹介する。EAGLE-3は、特徴量予測を放棄し、代わりに直接トークン予測を行い、最上位層の特徴量への依存を、トレーニングタイムテストと呼ばれる技術による多層特徴量融合に置き換える。これらの改善により、性能が大幅に向上し、ドラフトモデルがトレーニングデータのスケールアップを最大限に活用できるようになる。我々の実験では、チャットモデルと推論モデルの両方を含め、5つのタスクで評価を行った。その結果、EAGLE-3は最大6.5倍の高速化を達成し、EAGLE-2と比較して約1.4倍の改善を示した。コードはhttps://github.com/SafeAILab/EAGLEで公開されている。

English

The sequential nature of modern LLMs makes them expensive and slow, and speculative sampling has proven to be an effective solution to this problem. Methods like EAGLE perform autoregression at the feature level, reusing top-layer features from the target model to achieve better results than vanilla speculative sampling. A growing trend in the LLM community is scaling up training data to improve model intelligence without increasing inference costs. However, we observe that scaling up data provides limited improvements for EAGLE. We identify that this limitation arises from EAGLE's feature prediction constraints. In this paper, we introduce EAGLE-3, which abandons feature prediction in favor of direct token prediction and replaces reliance on top-layer features with multi-layer feature fusion via a technique named training-time test. These improvements significantly enhance performance and enable the draft model to fully benefit from scaling up training data. Our experiments include both chat models and reasoning models, evaluated on five tasks. The results show that EAGLE-3 achieves a speedup ratio up to 6.5x, with about 1.4x improvement over EAGLE-2. The code is available at https://github.com/SafeAILab/EAGLE.

EAGLE-3: トレーニングタイムテストによる大規模言語モデルの推論加速のスケールアップ

EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test

要旨

Support