级联推测草稿以实现更快的LLM推理

摘要

推测性解码通过利用一个草稿模型为更大的目标模型起草，从而提高大型语言模型（LLMs）的效率。然而，在推测性解码中起草涉及缓慢的自回归生成，以及在相同时间分配下生成不同重要性的标记。这两种低效性导致了其表现不佳。为了解决这个问题，我们引入了级联推测起草（CS. Drafting），这是一种新颖的方法，采用了两种级联类型。垂直级联消除了神经模型中的自回归生成。水平级联通过我们的理论分析支持其在起草中的高效时间分配。结合这两种级联，我们的CS. Drafting算法在实验中实现了高达72%的额外加速，同时保持相同的输出分布。

English

Speculative decoding enhances the efficiency of large language models (LLMs) by leveraging a draft model to draft for a larger target model to review. However, drafting in speculative decoding involves slow autoregressive generation and generating tokens of different importance with the same time allocation. These two inefficiencies lead to its suboptimal performance. To address this issue, we introduce Cascade Speculative Drafting (CS. Drafting), a novel approach that employs two types of cascades. The Vertical Cascade eliminates autoregressive generation from neural models. The Horizontal Cascade constitutes efficient time allocation in drafting with its optimality supported by our theoretical analysis. Combining both cascades, our CS. Drafting algorithm has achieved up to 72 percent additional speedup over speculative decoding in our experiments while keeping the same output distribution.

级联推测草稿以实现更快的LLM推理

Cascade Speculative Drafting for Even Faster LLM Inference

摘要

Support