ChatPaper.aiChatPaper

FR-Spec:基於頻率排序的推測性採樣加速大詞彙量語言模型

FR-Spec: Accelerating Large-Vocabulary Language Models via Frequency-Ranked Speculative Sampling

February 20, 2025
作者: Weilin Zhao, Tengyu Pan, Xu Han, Yudi Zhang, Ao Sun, Yuxiang Huang, Kaihuo Zhang, Weilun Zhao, Yuxuan Li, Jianyong Wang, Zhiyuan Liu, Maosong Sun
cs.AI

摘要

推測性採樣已成為加速大型語言模型(LLMs)自迴歸生成過程的重要技術,其通過採用“草擬後驗證”機制,在每次前向傳遞中生成多個詞元。雖然最先進的推測性採樣方法僅使用單一層和一個語言建模(LM)頭作為草擬模型,實現了顯著的層壓縮,但其效率提升在面對大詞彙量的LLMs(如擁有128k詞元的Llama-3-8B)時大幅降低。為解決這一問題,我們提出了FR-Spec,這是一種基於頻率排序的推測性採樣框架,通過詞彙空間壓縮來優化草擬候選詞的選擇。通過將草擬搜索限制在優先考慮頻率的詞元子集內,我們的方法在確保最終輸出分佈等價的同時,將LM頭的計算開銷降低了75%。多個數據集上的實驗表明,相較於最先進的推測性採樣方法EAGLE-2,FR-Spec平均實現了1.12倍的加速。
English
Speculative sampling has emerged as an important technique for accelerating the auto-regressive generation process of large language models (LLMs) by utilizing a draft-then-verify mechanism to produce multiple tokens per forward pass. While state-of-the-art speculative sampling methods use only a single layer and a language modeling (LM) head as the draft model to achieve impressive layer compression, their efficiency gains are substantially reduced for large-vocabulary LLMs, such as Llama-3-8B with a vocabulary of 128k tokens. To address this, we present FR-Spec, a frequency-ranked speculative sampling framework that optimizes draft candidate selection through vocabulary space compression. By constraining the draft search to a frequency-prioritized token subset, our method reduces LM Head computation overhead by 75% while ensuring the equivalence of the final output distribution. Experiments across multiple datasets demonstrate an average of 1.12times speedup over the state-of-the-art speculative sampling method EAGLE-2.

Summary

AI-Generated Summary

PDF82March 5, 2025