基於暫存解碼中的時間局部性，利用階層起草實現大型語言模型的無損加速。

摘要

在大型語言模型（LLMs）中加速推論對於即時互動至關重要，因為它們已被廣泛應用於現實世界的服務中。為了提高推論速度，一種完全算法化的解決方案——推測性解碼，通過起草和驗證標記引起了關注，從而在單個前向傳遞中生成多個標記。然而，目前的起草策略通常需要進行重大微調，或者在不同任務之間表現不一致。為了應對這些挑戰，我們提出了層次起草（HD），這是一種基於時間局部性的新型無損起草方法，將各種標記來源組織到基於層次結構的多個數據庫中。在起草步驟中，HD從最高到最低的局部性依次訪問多個數據庫，以獲取起草標記，確保在不同任務之間實現一致的加速，並將起草延遲降至最低。我們在使用具有7B和13B參數的LLMs進行的Spec-Bench實驗中表明，HD優於現有的數據庫起草方法，在模型大小、任務和溫度之間實現了穩健的推理加速。

English

Accelerating inference in Large Language Models (LLMs) is critical for real-time interactions, as they have been widely incorporated into real-world services. Speculative decoding, a fully algorithmic solution, has gained attention for improving inference speed by drafting and verifying tokens, thereby generating multiple tokens in a single forward pass. However, current drafting strategies usually require significant fine-tuning or have inconsistent performance across tasks. To address these challenges, we propose Hierarchy Drafting (HD), a novel lossless drafting approach that organizes various token sources into multiple databases in a hierarchical framework based on temporal locality. In the drafting step, HD sequentially accesses multiple databases to obtain draft tokens from the highest to the lowest locality, ensuring consistent acceleration across diverse tasks and minimizing drafting latency. Our experiments on Spec-Bench using LLMs with 7B and 13B parameters demonstrate that HD outperforms existing database drafting methods, achieving robust inference speedups across model sizes, tasks, and temperatures.

基於暫存解碼中的時間局部性，利用階層起草實現大型語言模型的無損加速。

Lossless Acceleration of Large Language Models with Hierarchical Drafting based on Temporal Locality in Speculative Decoding

摘要

Support