EAGLE-3: 훈련 시간 테스트를 통한 대규모 언어 모델의 추론 가속화 확장

초록

현대의 대형 언어 모델(LLM)은 순차적인 특성으로 인해 비용이 많이 들고 속도가 느리며, 이 문제에 대한 효과적인 해결책으로 스펙티브 샘플링(speculative sampling)이 입증되었습니다. EAGLE과 같은 방법은 타겟 모델의 최상위 레이어 특징을 재사용하여 피처 수준에서 자기회귀를 수행함으로써 기본 스펙티브 샘플링보다 더 나은 결과를 달성합니다. LLM 커뮤니티에서는 추론 비용을 증가시키지 않으면서 모델의 지능을 향상시키기 위해 학습 데이터를 확장하는 추세가 증가하고 있습니다. 그러나 우리는 데이터를 확장해도 EAGLE의 성능 향상이 제한적이라는 점을 관찰했습니다. 이는 EAGLE의 피처 예측 제약에서 비롯된 것으로 확인되었습니다. 본 논문에서는 피처 예측을 포기하고 직접 토큰 예측을 채택하며, 최상위 레이어 특징에 대한 의존을 '훈련 시간 테스트(training-time test)'라는 기술을 통한 다층 피처 융합으로 대체한 EAGLE-3을 소개합니다. 이러한 개선으로 성능이 크게 향상되었으며, 드래프트 모델이 학습 데이터 확장의 이점을 완전히 누릴 수 있게 되었습니다. 우리의 실험은 채팅 모델과 추론 모델을 모두 포함하며, 다섯 가지 작업에서 평가되었습니다. 결과는 EAGLE-3이 최대 6.5배의 속도 향상을 달성했으며, EAGLE-2 대비 약 1.4배의 개선을 보여줍니다. 코드는 https://github.com/SafeAILab/EAGLE에서 확인할 수 있습니다.

English

The sequential nature of modern LLMs makes them expensive and slow, and speculative sampling has proven to be an effective solution to this problem. Methods like EAGLE perform autoregression at the feature level, reusing top-layer features from the target model to achieve better results than vanilla speculative sampling. A growing trend in the LLM community is scaling up training data to improve model intelligence without increasing inference costs. However, we observe that scaling up data provides limited improvements for EAGLE. We identify that this limitation arises from EAGLE's feature prediction constraints. In this paper, we introduce EAGLE-3, which abandons feature prediction in favor of direct token prediction and replaces reliance on top-layer features with multi-layer feature fusion via a technique named training-time test. These improvements significantly enhance performance and enable the draft model to fully benefit from scaling up training data. Our experiments include both chat models and reasoning models, evaluated on five tasks. The results show that EAGLE-3 achieves a speedup ratio up to 6.5x, with about 1.4x improvement over EAGLE-2. The code is available at https://github.com/SafeAILab/EAGLE.

EAGLE-3: 훈련 시간 테스트를 통한 대규모 언어 모델의 추론 가속화 확장

EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test

초록

Support