自己回帰型Transformer APIの推論効率メトリクスの低コスト評価

要旨

大規模言語モデル（LLM）は、自然言語処理における多くの最先端システムを支えています。しかし、これらのモデルは推論時であっても非常に計算コストが高く、自然と次の疑問が浮かびます：より大きなモデルを導入する追加コストは、予想される能力向上に見合うのか？このトレードオフを根本的に理解するためには、(i) 異なるプロバイダーのモデル間で容易に比較可能であり、(ii) 隔離された性能環境でクエリを実行する真のコストを代表する、推論効率の指標が有益です。残念ながら、現在のLLMへのアクセスは主にブラックボックスのテキスト生成APIに限定されており、このインターフェースを通じて測定された生の実行時間はこれらの要件を満たしません：モデルプロバイダーはモデルとは直交する様々なソフトウェアおよびハードウェア最適化を適用でき、共有インフラストラクチャ上で提供されるモデルは性能競合の影響を受けやすいためです。これらの問題を回避するため、我々はモデル間の推論効率を比較するための新しい指標を提案します。この指標は、モデルが(i) 統一されたハードウェアとソフトウェア上で提供され、(ii) 性能競合がないかのように、公平な立場に置きます。我々はこの指標を「理想化された実行時間」と呼び、自己回帰型Transformerモデルに対してこの指標を効率的に推定する方法論を提案します。また、モデルを提供するために必要なアクセラレータの数を組み込んだコストを考慮したバリエーションも提案します。これらの指標を用いて、我々は10の最先端LLMを比較し、推論効率と能力のトレードオフに関する初の分析を提供します。この分析から得られたいくつかの観察結果には、特定のAPIの優れた推論実行時間性能が、しばしば基盤となるモデルではなくAPI内の最適化の副産物であるという事実が含まれます。我々の方法論は、異なるソフトウェアおよびハードウェアスタックの効率的な比較も容易にします。

English

Large language models (LLMs) power many state-of-the-art systems in natural language processing. However, these models are extremely computationally expensive, even at inference time, raising the natural question: when is the extra cost of deploying a larger model worth the anticipated boost in capabilities? Better understanding this tradeoff fundamentally could benefit from an inference efficiency metric that is both (i) easily comparable across models from different providers, and (ii) representative of the true cost of running queries in an isolated performance environment. Unfortunately, access to LLMs today is largely restricted to black-box text generation APIs and raw runtimes measured through this interface do not satisfy these desiderata: model providers can apply various software and hardware optimizations orthogonal to the model, and models served on shared infrastructure are susceptible to performance contention. To circumvent these problems, we propose a new metric for comparing inference efficiency across models. This metric puts models on equal footing as though they were served (i) on uniform hardware and software, and (ii) without performance contention. We call this metric the idealized runtime, and we propose a methodology to efficiently estimate this metric for autoregressive Transformer models. We also propose cost-aware variants that incorporate the number of accelerators needed to serve the model. Using these metrics, we compare ten state-of-the-art LLMs to provide the first analysis of inference efficiency-capability tradeoffs; we make several observations from this analysis, including the fact that the superior inference runtime performance of certain APIs is often a byproduct of optimizations within the API rather than the underlying model. Our methodology also facilitates the efficient comparison of different software and hardware stacks.

自己回帰型Transformer APIの推論効率メトリクスの低コスト評価

Cheaply Evaluating Inference Efficiency Metrics for Autoregressive Transformer APIs

要旨

Support