基於暫存解碼中的時間局部性,利用階層起草實現大型語言模型的無損加速。
Lossless Acceleration of Large Language Models with Hierarchical Drafting based on Temporal Locality in Speculative Decoding
February 8, 2025
作者: Sukmin Cho, Sangjin Choi, Taeho Hwang, Jeongyeon Seo, Soyeong Jeong, Huije Lee, Hoyun Song, Jong C. Park, Youngjin Kwon
cs.AI
摘要
在大型語言模型(LLMs)中加速推論對於即時互動至關重要,因為它們已被廣泛應用於現實世界的服務中。為了提高推論速度,一種完全算法化的解決方案——推測性解碼,通過起草和驗證標記引起了關注,從而在單個前向傳遞中生成多個標記。然而,目前的起草策略通常需要進行重大微調,或者在不同任務之間表現不一致。為了應對這些挑戰,我們提出了層次起草(HD),這是一種基於時間局部性的新型無損起草方法,將各種標記來源組織到基於層次結構的多個數據庫中。在起草步驟中,HD從最高到最低的局部性依次訪問多個數據庫,以獲取起草標記,確保在不同任務之間實現一致的加速,並將起草延遲降至最低。我們在使用具有7B和13B參數的LLMs進行的Spec-Bench實驗中表明,HD優於現有的數據庫起草方法,在模型大小、任務和溫度之間實現了穩健的推理加速。
English
Accelerating inference in Large Language Models (LLMs) is critical for
real-time interactions, as they have been widely incorporated into real-world
services. Speculative decoding, a fully algorithmic solution, has gained
attention for improving inference speed by drafting and verifying tokens,
thereby generating multiple tokens in a single forward pass. However, current
drafting strategies usually require significant fine-tuning or have
inconsistent performance across tasks. To address these challenges, we propose
Hierarchy Drafting (HD), a novel lossless drafting approach that organizes
various token sources into multiple databases in a hierarchical framework based
on temporal locality. In the drafting step, HD sequentially accesses multiple
databases to obtain draft tokens from the highest to the lowest locality,
ensuring consistent acceleration across diverse tasks and minimizing drafting
latency. Our experiments on Spec-Bench using LLMs with 7B and 13B parameters
demonstrate that HD outperforms existing database drafting methods, achieving
robust inference speedups across model sizes, tasks, and temperatures.