廉价评估自回归Transformer API的推理效率指标

摘要

大型语言模型（LLMs）驱动着自然语言处理中许多最先进的系统。然而，这些模型在推断时非常耗费计算资源，自然引发了一个问题：在部署更大模型的额外成本何时值得预期的能力提升？更好地理解这种权衡基本上可以从推断效率度量中获益，这种度量既易于跨不同提供者的模型进行比较，又代表了在独立性能环境中运行查询的真实成本。不幸的是，如今对LLMs的访问主要限于黑盒文本生成API和通过该接口测量的原始运行时间，并不能满足这些要求：模型提供者可以应用与模型正交的各种软件和硬件优化，而在共享基础设施上提供的模型容易受到性能争用的影响。为了规避这些问题，我们提出了一种用于比较模型推断效率的新度量标准。该度量标准使模型处于相同的地位，就好像它们是（i）在统一的硬件和软件上提供的，以及（ii）没有性能争用。我们将这个度量标准称为理想化运行时间，并提出了一种有效估算自回归Transformer模型的方法。我们还提出了考虑成本的变体，这些变体包括为提供模型所需的加速器数量。利用这些度量标准，我们比较了十个最先进的LLMs，以提供关于推断效率和能力权衡的首次分析；我们从这一分析中得出了几点观察，包括某些API的卓越推断运行时性能往往是API内部优化的副产品，而不是基础模型。我们的方法还有助于有效比较不同的软件和硬件堆栈。

English

Large language models (LLMs) power many state-of-the-art systems in natural language processing. However, these models are extremely computationally expensive, even at inference time, raising the natural question: when is the extra cost of deploying a larger model worth the anticipated boost in capabilities? Better understanding this tradeoff fundamentally could benefit from an inference efficiency metric that is both (i) easily comparable across models from different providers, and (ii) representative of the true cost of running queries in an isolated performance environment. Unfortunately, access to LLMs today is largely restricted to black-box text generation APIs and raw runtimes measured through this interface do not satisfy these desiderata: model providers can apply various software and hardware optimizations orthogonal to the model, and models served on shared infrastructure are susceptible to performance contention. To circumvent these problems, we propose a new metric for comparing inference efficiency across models. This metric puts models on equal footing as though they were served (i) on uniform hardware and software, and (ii) without performance contention. We call this metric the idealized runtime, and we propose a methodology to efficiently estimate this metric for autoregressive Transformer models. We also propose cost-aware variants that incorporate the number of accelerators needed to serve the model. Using these metrics, we compare ten state-of-the-art LLMs to provide the first analysis of inference efficiency-capability tradeoffs; we make several observations from this analysis, including the fact that the superior inference runtime performance of certain APIs is often a byproduct of optimizations within the API rather than the underlying model. Our methodology also facilitates the efficient comparison of different software and hardware stacks.

廉价评估自回归Transformer API的推理效率指标

Cheaply Evaluating Inference Efficiency Metrics for Autoregressive Transformer APIs

摘要

Support