ウロボロス：大規模モデルによる強化ドラフトを用いた推測的デコーディング

要旨

ドラフト生成後に検証を行うデコード手法、例えばスペキュレーティブデコーディングは、大規模言語モデル（LLM）の推論を加速するためのトレーニング不要な手法として広く採用されている。トークンを逐次的にデコードする自己回帰プロセスを採用する代わりに、スペキュレーティブデコーディングは最初に効率的な小型モデルを用いてドラフトを生成する。その後、LLMは非自己回帰的な方法で検証と修正を行い、時間的なオーバーヘッドを最小化する。検証が成功すれば、より長いドラフトを生成することでさらなる高速化が可能であるが、失敗した場合には多大な試行錯誤のコストが発生する。既存のデコード手法は、検証失敗の確率が高いため、一度に検証するための内容を多くドラフトすることができず、最適ではない推論加速を実現している。本論文では、Ouroborosを紹介する。Ouroborosは、LLMの検証プロセスからフレーズ候補プールを構築し、小型モデルのドラフト生成のための候補を提供する。これにより、Ouroborosは初期ドラフトの効率と効果をさらに向上させることができる。典型的なテキスト生成タスクにおける実験結果は、Ouroborosがルックアヘッドデコーディングとスペキュレーティブデコーディングと比較して、それぞれ最大1.9倍および2.8倍の高速化を達成することを示している。Ouroborosのソースコードはhttps://github.com/thunlp/Ouroborosで公開されている。

English

Drafting-then-verifying decoding methods such as speculative decoding are widely adopted training-free methods to accelerate the inference of large language models (LLMs). Instead of employing an autoregressive process to decode tokens sequentially, speculative decoding initially creates drafts with an efficient small model. Then LLMs are required to conduct verification and correction in a non-autoregressive fashion to minimize time overhead. Generating longer drafts can lead to even more significant speedups once verified, but also incurs substantial trial and error costs if it fails. Suffering from the high verification failure probability, existing decoding methods cannot draft too much content for verification at one time, achieving sub-optimal inference acceleration. In this paper, we introduce Ouroboros, which constructs a phrase candidate pool from the verification process of LLMs to provide candidates for draft generation of the small model. Thereby, Ouroboros can further improve the efficiency and effectiveness of the initial drafts. The experimental results on typical text generation tasks show that Ouroboros achieves speedups of up to 1.9x and 2.8x compared to lookahead decoding and speculative decoding, respectively. The source code of Ouroboros is available at https://github.com/thunlp/Ouroboros.

ウロボロス：大規模モデルによる強化ドラフトを用いた推測的デコーディング

Ouroboros: Speculative Decoding with Large Model Enhanced Drafting

要旨

Summary

Support

Support