大規模言語モデルにおける高速推測デコーディングのためのリカレントドラフター

要旨

本論文では、大規模言語モデルの推論効率を向上させることを目的とした、改良版の推測的デコード手法を提案する。我々の手法は、古典的な二モデル推測的デコードアプローチと、より最近の単一モデルアプローチであるMedusaという2つの確立された技術の長所を活用している。Medusaから着想を得て、我々のアプローチは単一モデル戦略を推測的デコードに採用している。しかし、我々の手法は、古典的な推測的デコードで使用される小さなドラフトモデルと本質的に類似した、再帰的依存関係設計を持つ単一の軽量ドラフトヘッドを使用する点で特徴的であり、完全なトランスフォーマーアーキテクチャの複雑さを伴わない。また、再帰的依存関係により、ビームサーチを使用してドラフトヘッドで不要な候補を迅速にフィルタリングすることが可能である。その結果、単一モデル設計の簡潔さを維持しつつ、Medusaで推論専用にデータ依存のツリーアテンション構造を作成する必要性を回避する手法が得られる。我々は、いくつかの人気のあるオープンソース言語モデルにおいて、提案手法の有効性を実証し、このアプローチを採用する際のトレードオフに関する包括的な分析を行う。

English

In this paper, we introduce an improved approach of speculative decoding aimed at enhancing the efficiency of serving large language models. Our method capitalizes on the strengths of two established techniques: the classic two-model speculative decoding approach, and the more recent single-model approach, Medusa. Drawing inspiration from Medusa, our approach adopts a single-model strategy for speculative decoding. However, our method distinguishes itself by employing a single, lightweight draft head with a recurrent dependency design, akin in essence to the small, draft model uses in classic speculative decoding, but without the complexities of the full transformer architecture. And because of the recurrent dependency, we can use beam search to swiftly filter out undesired candidates with the draft head. The outcome is a method that combines the simplicity of single-model design and avoids the need to create a data-dependent tree attention structure only for inference in Medusa. We empirically demonstrate the effectiveness of the proposed method on several popular open source language models, along with a comprehensive analysis of the trade-offs involved in adopting this approach.

大規模言語モデルにおける高速推測デコーディングのためのリカレントドラフター

Recurrent Drafter for Fast Speculative Decoding in Large Language Models

要旨

Summary

Support

Support