廉价评估自回归Transformer API的推理效率指标
Cheaply Evaluating Inference Efficiency Metrics for Autoregressive Transformer APIs
May 3, 2023
作者: Deepak Narayanan, Keshav Santhanam, Peter Henderson, Rishi Bommasani, Tony Lee, Percy Liang
cs.AI
摘要
大型语言模型(LLMs)驱动着自然语言处理中许多最先进的系统。然而,这些模型在推断时非常耗费计算资源,自然引发了一个问题:在部署更大模型的额外成本何时值得预期的能力提升?更好地理解这种权衡基本上可以从推断效率度量中获益,这种度量既易于跨不同提供者的模型进行比较,又代表了在独立性能环境中运行查询的真实成本。不幸的是,如今对LLMs的访问主要限于黑盒文本生成API和通过该接口测量的原始运行时间,并不能满足这些要求:模型提供者可以应用与模型正交的各种软件和硬件优化,而在共享基础设施上提供的模型容易受到性能争用的影响。为了规避这些问题,我们提出了一种用于比较模型推断效率的新度量标准。该度量标准使模型处于相同的地位,就好像它们是(i)在统一的硬件和软件上提供的,以及(ii)没有性能争用。我们将这个度量标准称为理想化运行时间,并提出了一种有效估算自回归Transformer模型的方法。我们还提出了考虑成本的变体,这些变体包括为提供模型所需的加速器数量。利用这些度量标准,我们比较了十个最先进的LLMs,以提供关于推断效率和能力权衡的首次分析;我们从这一分析中得出了几点观察,包括某些API的卓越推断运行时性能往往是API内部优化的副产品,而不是基础模型。我们的方法还有助于有效比较不同的软件和硬件堆栈。
English
Large language models (LLMs) power many state-of-the-art systems in natural
language processing. However, these models are extremely computationally
expensive, even at inference time, raising the natural question: when is the
extra cost of deploying a larger model worth the anticipated boost in
capabilities? Better understanding this tradeoff fundamentally could benefit
from an inference efficiency metric that is both (i) easily comparable across
models from different providers, and (ii) representative of the true cost of
running queries in an isolated performance environment. Unfortunately, access
to LLMs today is largely restricted to black-box text generation APIs and raw
runtimes measured through this interface do not satisfy these desiderata: model
providers can apply various software and hardware optimizations orthogonal to
the model, and models served on shared infrastructure are susceptible to
performance contention. To circumvent these problems, we propose a new metric
for comparing inference efficiency across models. This metric puts models on
equal footing as though they were served (i) on uniform hardware and software,
and (ii) without performance contention. We call this metric the
idealized runtime, and we propose a methodology to efficiently estimate
this metric for autoregressive Transformer models. We also propose cost-aware
variants that incorporate the number of accelerators needed to serve the model.
Using these metrics, we compare ten state-of-the-art LLMs to provide the first
analysis of inference efficiency-capability tradeoffs; we make several
observations from this analysis, including the fact that the superior inference
runtime performance of certain APIs is often a byproduct of optimizations
within the API rather than the underlying model. Our methodology also
facilitates the efficient comparison of different software and hardware stacks.