大型語言模型的分散式推理
Distributed Speculative Inference of Large Language Models
May 23, 2024
作者: Nadav Timor, Jonathan Mamou, Daniel Korat, Moshe Berchansky, Oren Pereg, Moshe Wasserblat, Tomer Galanti, Michal Gordon, David Harel
cs.AI
摘要
加速大型語言模型(LLMs)的推論是人工智慧中的一個重要挑戰。本文介紹了分佈式推測推論(DSI),這是一種新穎的分佈式推論算法,可以證明比推測推論(SI)[leviathan2023fast、chen2023accelerating、miao2023specinfer]和傳統的自回歸推論(非SI)更快。與其他SI算法一樣,DSI適用於凍結的LLMs,無需訓練或架構修改,並保留目標分佈。
先前有關SI的研究已經證明了實證加速(與非SI相比),但需要一個快速準確的起草LLM。在實踐中,現成的LLMs通常沒有足夠快速和準確的匹配起草者。我們展示了一個差距:當使用較慢或不夠準確的起草者時,SI比非SI更慢。我們通過證明DSI比SI和非SI更快,無論使用任何起草者,來彌合這一差距。通過協調目標和起草者的多個實例,DSI不僅比SI更快,而且支持無法通過SI加速的LLMs。
我們的模擬顯示在實際環境中現成的LLMs的加速效果:DSI比SI快1.29-1.92倍。
English
Accelerating the inference of large language models (LLMs) is an important
challenge in artificial intelligence. This paper introduces distributed
speculative inference (DSI), a novel distributed inference algorithm that is
provably faster than speculative inference (SI) [leviathan2023fast,
chen2023accelerating, miao2023specinfer] and traditional autoregressive
inference (non-SI). Like other SI algorithms, DSI works on frozen LLMs,
requiring no training or architectural modifications, and it preserves the
target distribution.
Prior studies on SI have demonstrated empirical speedups (compared to non-SI)
but require a fast and accurate drafter LLM. In practice, off-the-shelf LLMs
often do not have matching drafters that are sufficiently fast and accurate. We
show a gap: SI gets slower than non-SI when using slower or less accurate
drafters. We close this gap by proving that DSI is faster than both SI and
non-SI given any drafters. By orchestrating multiple instances of the target
and drafters, DSI is not only faster than SI but also supports LLMs that cannot
be accelerated with SI.
Our simulations show speedups of off-the-shelf LLMs in realistic settings:
DSI is 1.29-1.92x faster than SI.