以低成本評估自回歸Transformer API的推論效率指標
Cheaply Evaluating Inference Efficiency Metrics for Autoregressive Transformer APIs
May 3, 2023
作者: Deepak Narayanan, Keshav Santhanam, Peter Henderson, Rishi Bommasani, Tony Lee, Percy Liang
cs.AI
摘要
大型語言模型(LLMs)驅動許多自然語言處理中的最先進系統。然而,這些模型在推論時極其耗費計算資源,引發一個自然的問題:部署更大型模型的額外成本在何時值得預期的能力提升?更好地理解這種權衡基本上可以從一個推論效率指標中受益,該指標既(i)易於跨不同提供者的模型進行比較,又(ii)代表在獨立性能環境中運行查詢的真實成本。不幸的是,如今對LLMs的訪問主要限於黑盒文本生成API和通過此接口測量的原始運行時間,這些並不滿足這些要求:模型提供者可以應用與模型正交的各種軟件和硬件優化,以及在共享基礎設施上提供的模型容易受到性能爭奪的影響。為了避免這些問題,我們提出了一種新的指標來比較模型之間的推論效率。這個指標使模型處於相同的起跑線上,就好像它們是(i)在統一的硬件和軟件上運行,以及(ii)沒有性能爭奪。我們稱這個指標為理想運行時間,並提出了一種有效估計這個指標用於自回歸Transformer模型的方法。我們還提出了考慮成本的變體,這些變體包括為提供模型所需的加速器數量。利用這些指標,我們比較了十個最先進的LLMs,以提供對推論效率-能力權衡的首次分析;我們從這個分析中得出了幾點觀察,包括某些API的優越推論運行時間性能通常是API內部優化的副產物,而不是基礎模型。我們的方法還有助於有效比較不同的軟件和硬件堆棧。
English
Large language models (LLMs) power many state-of-the-art systems in natural
language processing. However, these models are extremely computationally
expensive, even at inference time, raising the natural question: when is the
extra cost of deploying a larger model worth the anticipated boost in
capabilities? Better understanding this tradeoff fundamentally could benefit
from an inference efficiency metric that is both (i) easily comparable across
models from different providers, and (ii) representative of the true cost of
running queries in an isolated performance environment. Unfortunately, access
to LLMs today is largely restricted to black-box text generation APIs and raw
runtimes measured through this interface do not satisfy these desiderata: model
providers can apply various software and hardware optimizations orthogonal to
the model, and models served on shared infrastructure are susceptible to
performance contention. To circumvent these problems, we propose a new metric
for comparing inference efficiency across models. This metric puts models on
equal footing as though they were served (i) on uniform hardware and software,
and (ii) without performance contention. We call this metric the
idealized runtime, and we propose a methodology to efficiently estimate
this metric for autoregressive Transformer models. We also propose cost-aware
variants that incorporate the number of accelerators needed to serve the model.
Using these metrics, we compare ten state-of-the-art LLMs to provide the first
analysis of inference efficiency-capability tradeoffs; we make several
observations from this analysis, including the fact that the superior inference
runtime performance of certain APIs is often a byproduct of optimizations
within the API rather than the underlying model. Our methodology also
facilitates the efficient comparison of different software and hardware stacks.