大型语言模型的分布式推理

摘要

在人工智能领域，加速大型语言模型（LLMs）的推理是一项重要挑战。本文介绍了分布式推理（DSI），这是一种全新的分布式推理算法，可以证明比推测推理（SI）[leviathan2023fast，chen2023accelerating，miao2023specinfer]和传统的自回归推理（非SI）更快。与其他SI算法类似，DSI适用于冻结的LLMs，无需训练或架构修改，并且保留目标分布。之前关于SI的研究已经证明了实证加速（与非SI相比），但需要一个快速准确的草稿LLM。实际上，现成的LLMs通常没有足够快速和准确的匹配草稿。我们展示了一个差距：当使用较慢或不够准确的草稿时，SI比非SI更慢。我们通过证明DSI比SI和非SI更快，无论使用何种草稿，来弥合这一差距。通过协调目标和草稿的多个实例，DSI不仅比SI更快，而且支持无法通过SI加速的LLMs。我们的模拟显示在现实环境中现成的LLMs加速：DSI比SI快1.29-1.92倍。

English

Accelerating the inference of large language models (LLMs) is an important challenge in artificial intelligence. This paper introduces distributed speculative inference (DSI), a novel distributed inference algorithm that is provably faster than speculative inference (SI) [leviathan2023fast, chen2023accelerating, miao2023specinfer] and traditional autoregressive inference (non-SI). Like other SI algorithms, DSI works on frozen LLMs, requiring no training or architectural modifications, and it preserves the target distribution. Prior studies on SI have demonstrated empirical speedups (compared to non-SI) but require a fast and accurate drafter LLM. In practice, off-the-shelf LLMs often do not have matching drafters that are sufficiently fast and accurate. We show a gap: SI gets slower than non-SI when using slower or less accurate drafters. We close this gap by proving that DSI is faster than both SI and non-SI given any drafters. By orchestrating multiple instances of the target and drafters, DSI is not only faster than SI but also supports LLMs that cannot be accelerated with SI. Our simulations show speedups of off-the-shelf LLMs in realistic settings: DSI is 1.29-1.92x faster than SI.

大型语言模型的分布式推理

Distributed Speculative Inference of Large Language Models

摘要

Support