層級推測性草擬以進一步加速LLM推論。

摘要

推測性解碼通過利用一個初步模型起草，以供更大目標模型審查，從而提高大型語言模型（LLMs）的效率。然而，在推測性解碼中進行起草涉及緩慢的自回歸生成，並且在相同時間分配內生成不同重要性的標記。這兩種效率低下導致其表現不佳。為了解決這個問題，我們提出了級聯推測起草（CS. Drafting），這是一種採用兩種級聯類型的新方法。垂直級聯消除了神經模型中的自回歸生成。水平級聯通過我們的理論分析支持其在起草中的有效時間分配的最優性。結合這兩種級聯，我們的CS. Drafting算法在實驗中實現了高達72%的額外加速，同時保持相同的輸出分佈。

English

Speculative decoding enhances the efficiency of large language models (LLMs) by leveraging a draft model to draft for a larger target model to review. However, drafting in speculative decoding involves slow autoregressive generation and generating tokens of different importance with the same time allocation. These two inefficiencies lead to its suboptimal performance. To address this issue, we introduce Cascade Speculative Drafting (CS. Drafting), a novel approach that employs two types of cascades. The Vertical Cascade eliminates autoregressive generation from neural models. The Horizontal Cascade constitutes efficient time allocation in drafting with its optimality supported by our theoretical analysis. Combining both cascades, our CS. Drafting algorithm has achieved up to 72 percent additional speedup over speculative decoding in our experiments while keeping the same output distribution.

層級推測性草擬以進一步加速LLM推論。

Cascade Speculative Drafting for Even Faster LLM Inference

摘要

Support