大規模言語モデルの分散型推測推論

要旨

大規模言語モデル（LLM）の推論を加速することは、人工知能における重要な課題である。本論文では、分散型推測推論（Distributed Speculative Inference, DSI）を紹介する。これは、推測推論（Speculative Inference, SI）[leviathan2023fast, chen2023accelerating, miao2023specinfer]や従来の自己回帰型推論（非SI）よりも理論的に高速な、新たな分散推論アルゴリズムである。他のSIアルゴリズムと同様に、DSIは凍結されたLLM上で動作し、学習やアーキテクチャの変更を必要とせず、目標分布を保持する。これまでのSIに関する研究では、非SIと比較して経験的な高速化が実証されているが、高速かつ正確なドラフターLLMが必要とされる。実際には、市販のLLMには十分に高速かつ正確なドラフターが存在しないことが多い。我々は、ドラフターが遅いか精度が低い場合にSIが非SIよりも遅くなるというギャップを示す。このギャップを埋めるため、我々はDSIが任意のドラフターにおいてSIおよび非SIよりも高速であることを証明する。複数のターゲットおよびドラフターのインスタンスを調整することで、DSIはSIよりも高速であるだけでなく、SIでは加速できないLLMもサポートする。シミュレーション結果から、現実的な設定において市販のLLMの高速化が確認された：DSIはSIよりも1.29～1.92倍高速である。

English

Accelerating the inference of large language models (LLMs) is an important challenge in artificial intelligence. This paper introduces distributed speculative inference (DSI), a novel distributed inference algorithm that is provably faster than speculative inference (SI) [leviathan2023fast, chen2023accelerating, miao2023specinfer] and traditional autoregressive inference (non-SI). Like other SI algorithms, DSI works on frozen LLMs, requiring no training or architectural modifications, and it preserves the target distribution. Prior studies on SI have demonstrated empirical speedups (compared to non-SI) but require a fast and accurate drafter LLM. In practice, off-the-shelf LLMs often do not have matching drafters that are sufficiently fast and accurate. We show a gap: SI gets slower than non-SI when using slower or less accurate drafters. We close this gap by proving that DSI is faster than both SI and non-SI given any drafters. By orchestrating multiple instances of the target and drafters, DSI is not only faster than SI but also supports LLMs that cannot be accelerated with SI. Our simulations show speedups of off-the-shelf LLMs in realistic settings: DSI is 1.29-1.92x faster than SI.

大規模言語モデルの分散型推測推論

Distributed Speculative Inference of Large Language Models

要旨

Support