大型语言模型中用于快速推理解码的循环起草器

摘要

本文介绍了一种改进的推测解码方法，旨在提高为大型语言模型提供服务的效率。我们的方法充分利用了两种已建立的技术的优势：经典的双模型推测解码方法和较新的单模型方法Medusa。受Medusa启发，我们的方法采用了单模型策略进行推测解码。然而，我们的方法通过采用一种单一、轻量级的草稿头部，具有循环依赖设计，本质上类似于经典推测解码中使用的小型草稿模型，但没有完整Transformer架构的复杂性。由于循环依赖，我们可以使用束搜索快速过滤掉草稿头中的不需要的候选项。结果是一种结合了单模型设计简单性的方法，避免了在Medusa中仅用于推断的数据相关树注意力结构的需求。我们通过实证方法在几种流行的开源语言模型上展示了所提方法的有效性，并对采用这种方法涉及的权衡进行了全面分析。

English

In this paper, we introduce an improved approach of speculative decoding aimed at enhancing the efficiency of serving large language models. Our method capitalizes on the strengths of two established techniques: the classic two-model speculative decoding approach, and the more recent single-model approach, Medusa. Drawing inspiration from Medusa, our approach adopts a single-model strategy for speculative decoding. However, our method distinguishes itself by employing a single, lightweight draft head with a recurrent dependency design, akin in essence to the small, draft model uses in classic speculative decoding, but without the complexities of the full transformer architecture. And because of the recurrent dependency, we can use beam search to swiftly filter out undesired candidates with the draft head. The outcome is a method that combines the simplicity of single-model design and avoids the need to create a data-dependent tree attention structure only for inference in Medusa. We empirically demonstrate the effectiveness of the proposed method on several popular open source language models, along with a comprehensive analysis of the trade-offs involved in adopting this approach.

大型语言模型中用于快速推理解码的循环起草器

Recurrent Drafter for Fast Speculative Decoding in Large Language Models

摘要

Support