대규모 언어 모델의 분산형 추론 예측

초록

대규모 언어 모델(LLM)의 추론 속도를 높이는 것은 인공지능 분야에서 중요한 과제입니다. 본 논문은 분산형 추측 추론(DSI)이라는 새로운 분산 추론 알고리즘을 소개하며, 이는 기존의 추측 추론(SI) [leviathan2023fast, chen2023accelerating, miao2023specinfer]과 전통적인 자기회귀 추론(non-SI)보다 이론적으로 더 빠른 것으로 입증되었습니다. 다른 SI 알고리즘과 마찬가지로 DSI는 고정된 LLM에서 작동하며, 추가적인 학습이나 아키텍처 수정이 필요 없으며, 목표 분포를 보존합니다. 기존의 SI 연구는 (non-SI 대비) 경험적인 속도 향상을 보여주었지만, 빠르고 정확한 드래프터 LLM이 필요했습니다. 실제로, 기성품 LLM은 충분히 빠르고 정확한 드래프터를 갖추지 못하는 경우가 많습니다. 우리는 이러한 간극을 보여주었습니다: 더 느리거나 덜 정확한 드래프터를 사용할 경우 SI는 non-SI보다 느려집니다. 우리는 이 간극을 해소하기 위해 DSI가 어떤 드래프터를 사용하더라도 SI와 non-SI보다 빠르다는 것을 증명했습니다. DSI는 목표 모델과 드래프터의 여러 인스턴스를 조율함으로써 SI보다 빠를 뿐만 아니라, SI로는 가속화할 수 없는 LLM도 지원합니다. 우리의 시뮬레이션은 현실적인 설정에서 기성품 LLM의 속도 향상을 보여줍니다: DSI는 SI보다 1.29-1.92배 빠릅니다.

English

Accelerating the inference of large language models (LLMs) is an important challenge in artificial intelligence. This paper introduces distributed speculative inference (DSI), a novel distributed inference algorithm that is provably faster than speculative inference (SI) [leviathan2023fast, chen2023accelerating, miao2023specinfer] and traditional autoregressive inference (non-SI). Like other SI algorithms, DSI works on frozen LLMs, requiring no training or architectural modifications, and it preserves the target distribution. Prior studies on SI have demonstrated empirical speedups (compared to non-SI) but require a fast and accurate drafter LLM. In practice, off-the-shelf LLMs often do not have matching drafters that are sufficiently fast and accurate. We show a gap: SI gets slower than non-SI when using slower or less accurate drafters. We close this gap by proving that DSI is faster than both SI and non-SI given any drafters. By orchestrating multiple instances of the target and drafters, DSI is not only faster than SI but also supports LLMs that cannot be accelerated with SI. Our simulations show speedups of off-the-shelf LLMs in realistic settings: DSI is 1.29-1.92x faster than SI.

대규모 언어 모델의 분산형 추론 예측

Distributed Speculative Inference of Large Language Models

초록

Support