以低成本評估自回歸Transformer API的推論效率指標

摘要

大型語言模型（LLMs）驅動許多自然語言處理中的最先進系統。然而，這些模型在推論時極其耗費計算資源，引發一個自然的問題：部署更大型模型的額外成本在何時值得預期的能力提升？更好地理解這種權衡基本上可以從一個推論效率指標中受益，該指標既（i）易於跨不同提供者的模型進行比較，又（ii）代表在獨立性能環境中運行查詢的真實成本。不幸的是，如今對LLMs的訪問主要限於黑盒文本生成API和通過此接口測量的原始運行時間，這些並不滿足這些要求：模型提供者可以應用與模型正交的各種軟件和硬件優化，以及在共享基礎設施上提供的模型容易受到性能爭奪的影響。為了避免這些問題，我們提出了一種新的指標來比較模型之間的推論效率。這個指標使模型處於相同的起跑線上，就好像它們是（i）在統一的硬件和軟件上運行，以及（ii）沒有性能爭奪。我們稱這個指標為理想運行時間，並提出了一種有效估計這個指標用於自回歸Transformer模型的方法。我們還提出了考慮成本的變體，這些變體包括為提供模型所需的加速器數量。利用這些指標，我們比較了十個最先進的LLMs，以提供對推論效率-能力權衡的首次分析；我們從這個分析中得出了幾點觀察，包括某些API的優越推論運行時間性能通常是API內部優化的副產物，而不是基礎模型。我們的方法還有助於有效比較不同的軟件和硬件堆棧。

English

Large language models (LLMs) power many state-of-the-art systems in natural language processing. However, these models are extremely computationally expensive, even at inference time, raising the natural question: when is the extra cost of deploying a larger model worth the anticipated boost in capabilities? Better understanding this tradeoff fundamentally could benefit from an inference efficiency metric that is both (i) easily comparable across models from different providers, and (ii) representative of the true cost of running queries in an isolated performance environment. Unfortunately, access to LLMs today is largely restricted to black-box text generation APIs and raw runtimes measured through this interface do not satisfy these desiderata: model providers can apply various software and hardware optimizations orthogonal to the model, and models served on shared infrastructure are susceptible to performance contention. To circumvent these problems, we propose a new metric for comparing inference efficiency across models. This metric puts models on equal footing as though they were served (i) on uniform hardware and software, and (ii) without performance contention. We call this metric the idealized runtime, and we propose a methodology to efficiently estimate this metric for autoregressive Transformer models. We also propose cost-aware variants that incorporate the number of accelerators needed to serve the model. Using these metrics, we compare ten state-of-the-art LLMs to provide the first analysis of inference efficiency-capability tradeoffs; we make several observations from this analysis, including the fact that the superior inference runtime performance of certain APIs is often a byproduct of optimizations within the API rather than the underlying model. Our methodology also facilitates the efficient comparison of different software and hardware stacks.

以低成本評估自回歸Transformer API的推論效率指標

Cheaply Evaluating Inference Efficiency Metrics for Autoregressive Transformer APIs

摘要

Support