大型語言模型的分散式推理

摘要

加速大型語言模型（LLMs）的推論是人工智慧中的一個重要挑戰。本文介紹了分佈式推測推論（DSI），這是一種新穎的分佈式推論算法，可以證明比推測推論（SI）[leviathan2023fast、chen2023accelerating、miao2023specinfer]和傳統的自回歸推論（非SI）更快。與其他SI算法一樣，DSI適用於凍結的LLMs，無需訓練或架構修改，並保留目標分佈。先前有關SI的研究已經證明了實證加速（與非SI相比），但需要一個快速準確的起草LLM。在實踐中，現成的LLMs通常沒有足夠快速和準確的匹配起草者。我們展示了一個差距：當使用較慢或不夠準確的起草者時，SI比非SI更慢。我們通過證明DSI比SI和非SI更快，無論使用任何起草者，來彌合這一差距。通過協調目標和起草者的多個實例，DSI不僅比SI更快，而且支持無法通過SI加速的LLMs。我們的模擬顯示在實際環境中現成的LLMs的加速效果：DSI比SI快1.29-1.92倍。

English

Accelerating the inference of large language models (LLMs) is an important challenge in artificial intelligence. This paper introduces distributed speculative inference (DSI), a novel distributed inference algorithm that is provably faster than speculative inference (SI) [leviathan2023fast, chen2023accelerating, miao2023specinfer] and traditional autoregressive inference (non-SI). Like other SI algorithms, DSI works on frozen LLMs, requiring no training or architectural modifications, and it preserves the target distribution. Prior studies on SI have demonstrated empirical speedups (compared to non-SI) but require a fast and accurate drafter LLM. In practice, off-the-shelf LLMs often do not have matching drafters that are sufficiently fast and accurate. We show a gap: SI gets slower than non-SI when using slower or less accurate drafters. We close this gap by proving that DSI is faster than both SI and non-SI given any drafters. By orchestrating multiple instances of the target and drafters, DSI is not only faster than SI but also supports LLMs that cannot be accelerated with SI. Our simulations show speedups of off-the-shelf LLMs in realistic settings: DSI is 1.29-1.92x faster than SI.

大型語言模型的分散式推理

Distributed Speculative Inference of Large Language Models

摘要

Support