Cost-of-Pass: 言語モデル評価のための経済的フレームワーク

要旨

経済におけるAIシステムの広範な採用は、その推論コストを上回る経済的価値を生み出す能力にかかっている。このトレードオフを評価するには、性能とコストの両方を考慮した指標が必要である。我々は、精度と推論コストを組み合わせて言語モデルを評価するための、生産理論に基づいたフレームワークを提案する。ここで「コスト・オブ・パス」を、正しい解を生成するための期待金銭コストとして導入する。次に、「フロンティア・コスト・オブ・パス」を、利用可能なモデル全体で達成可能な最小のコスト・オブ・パス、または専門家を雇用する際の概算コストとして定義する。我々の分析は、いくつかの明確な経済的洞察を明らかにする。第一に、軽量モデルは基本的な定量的タスクで最もコスト効率が高く、大規模モデルは知識集約型タスクで、推論モデルは複雑な定量的問題で、トークンあたりのコストが高いにもかかわらず、最もコスト効率が高い。第二に、過去1年間にわたるこのフロンティア・コスト・オブ・パスの追跡は、特に複雑な定量的タスクにおいて、数ヶ月ごとにコストがほぼ半減するという著しい進歩を示している。第三に、この進歩を牽引する主要なイノベーションを追跡するために、特定のモデルクラスがない場合のコスト効率を推定する「反事実的フロンティア」を検証する。軽量モデル、大規模モデル、推論モデルにおけるイノベーションが、それぞれ基本的な定量的タスク、知識集約型タスク、複雑な定量的タスクにおけるフロンティアを押し上げるために不可欠であったことがわかる。最後に、多数決や自己改善といった一般的な推論時技術によるコスト削減効果を評価し、それらの限界的な精度向上がコストを正当化することは稀であることを明らかにする。我々の知見は、補完的なモデルレベルのイノベーションがコスト効率の主要な推進力であることを強調し、我々の経済的フレームワークは、この進歩を測定し展開を導くための原則的なツールを提供する。

English

The widespread adoption of AI systems in the economy hinges on their ability to generate economic value that outweighs their inference costs. Evaluating this tradeoff requires metrics that account for both performance and costs. We propose a framework grounded in production theory for evaluating language models by combining accuracy and inference cost. We introduce "cost-of-pass", the expected monetary cost of generating a correct solution. We then define the "frontier cost-of-pass" as the minimum cost-of-pass achievable across available models or the "human-expert, using the approximate cost of hiring an expert. Our analysis reveals distinct economic insights. First, lightweight models are most cost-effective for basic quantitative tasks, large models for knowledge-intensive ones, and reasoning models for complex quantitative problems, despite higher per-token costs. Second, tracking this frontier cost-of-pass over the past year reveals significant progress, particularly for complex quantitative tasks where the cost has roughly halved every few months. Third, to trace key innovations driving this progress, we examine counterfactual frontiers: estimates of cost-efficiency without specific model classes. We find that innovations in lightweight, large, and reasoning models have been essential for pushing the frontier in basic quantitative, knowledge-intensive, and complex quantitative tasks, respectively. Finally, we assess the cost-reductions afforded by common inference-time techniques like majority voting and self-refinement, finding that their marginal accuracy gains rarely justify their costs. Our findings underscore that complementary model-level innovations are the primary drivers of cost-efficiency, and our economic framework provides a principled tool for measuring this progress and guiding deployment.

Cost-of-Pass: 言語モデル評価のための経済的フレームワーク

Cost-of-Pass: An Economic Framework for Evaluating Language Models

要旨

Support