FR-Spec: 빈도수 기반 순위 예측 샘플링을 통한 대규모 어휘 언어 모델 가속화

초록

스펙큘레이티브 샘플링은 대형 언어 모델(LLM)의 자기회귀적 생성 과정을 가속화하기 위한 중요한 기술로 부상했으며, 다중 토큰을 한 번의 순방향 전달로 생성하기 위해 초안-검증 메커니즘을 활용합니다. 최첨단 스펙큘레이티브 샘플링 방법은 단일 레이어와 언어 모델링(LM) 헤드만을 초안 모델로 사용하여 인상적인 레이어 압축을 달성하지만, Llama-3-8B와 같이 어휘 크기가 128k 토큰에 이르는 대형 어휘 LLM의 경우 효율성 향상이 크게 감소합니다. 이를 해결하기 위해, 우리는 어휘 공간 압축을 통해 초안 후보 선택을 최적화하는 FR-Spec(주파수 기반 스펙큘레이티브 샘플링) 프레임워크를 제안합니다. 초안 검색을 주파수 우선순위 토큰 하위 집합으로 제한함으로써, 이 방법은 LM 헤드 계산 오버헤드를 75% 줄이면서도 최종 출력 분포의 동등성을 보장합니다. 여러 데이터셋에 대한 실험 결과, 최신 스펙큘레이티브 샘플링 방법인 EAGLE-2 대비 평균 1.12배의 속도 향상을 보여줍니다.

English

Speculative sampling has emerged as an important technique for accelerating the auto-regressive generation process of large language models (LLMs) by utilizing a draft-then-verify mechanism to produce multiple tokens per forward pass. While state-of-the-art speculative sampling methods use only a single layer and a language modeling (LM) head as the draft model to achieve impressive layer compression, their efficiency gains are substantially reduced for large-vocabulary LLMs, such as Llama-3-8B with a vocabulary of 128k tokens. To address this, we present FR-Spec, a frequency-ranked speculative sampling framework that optimizes draft candidate selection through vocabulary space compression. By constraining the draft search to a frequency-prioritized token subset, our method reduces LM Head computation overhead by 75% while ensuring the equivalence of the final output distribution. Experiments across multiple datasets demonstrate an average of 1.12times speedup over the state-of-the-art speculative sampling method EAGLE-2.

FR-Spec: 빈도수 기반 순위 예측 샘플링을 통한 대규모 어휘 언어 모델 가속화

FR-Spec: Accelerating Large-Vocabulary Language Models via Frequency-Ranked Speculative Sampling

초록

Support