成本效益分析:評估語言模型的經濟框架
Cost-of-Pass: An Economic Framework for Evaluating Language Models
April 17, 2025
作者: Mehmet Hamza Erol, Batu El, Mirac Suzgun, Mert Yuksekgonul, James Zou
cs.AI
摘要
AI系統在經濟中的廣泛應用,取決於其創造的經濟價值能否超越其推理成本。評估這一權衡需要同時考量性能和成本的指標。我們提出了一個基於生產理論的框架,通過結合準確性和推理成本來評估語言模型。我們引入了「通過成本」這一概念,即生成正確解決方案的預期貨幣成本。隨後,我們定義了「前沿通過成本」為在現有模型或「人類專家」中可實現的最低通過成本,後者基於聘請專家的近似成本。我們的分析揭示了獨特的經濟洞察。首先,輕量級模型在基礎定量任務中最具成本效益,大型模型在知識密集型任務中表現最佳,而推理模型則擅長處理複雜的定量問題,儘管其每令牌成本更高。其次,追蹤過去一年中這一前沿通過成本的變化顯示出顯著進步,特別是在複雜定量任務中,成本大約每幾個月減半。第三,為了追溯推動這一進步的關鍵創新,我們考察了反事實前沿:即假設不存在特定模型類別時的成本效率估計。我們發現,輕量級、大型和推理模型的創新分別對推動基礎定量、知識密集型和複雜定量任務的前沿至關重要。最後,我們評估了多數投票和自我精煉等常見推理時技術帶來的成本降低,發現它們的邊際準確性提升很少能證明其成本的合理性。我們的研究結果強調,互補的模型級創新是成本效率的主要驅動力,而我們的經濟框架為衡量這一進步和指導部署提供了原則性的工具。
English
The widespread adoption of AI systems in the economy hinges on their ability
to generate economic value that outweighs their inference costs. Evaluating
this tradeoff requires metrics that account for both performance and costs. We
propose a framework grounded in production theory for evaluating language
models by combining accuracy and inference cost. We introduce "cost-of-pass",
the expected monetary cost of generating a correct solution. We then define the
"frontier cost-of-pass" as the minimum cost-of-pass achievable across available
models or the "human-expert, using the approximate cost of hiring an expert.
Our analysis reveals distinct economic insights. First, lightweight models are
most cost-effective for basic quantitative tasks, large models for
knowledge-intensive ones, and reasoning models for complex quantitative
problems, despite higher per-token costs. Second, tracking this frontier
cost-of-pass over the past year reveals significant progress, particularly for
complex quantitative tasks where the cost has roughly halved every few months.
Third, to trace key innovations driving this progress, we examine
counterfactual frontiers: estimates of cost-efficiency without specific model
classes. We find that innovations in lightweight, large, and reasoning models
have been essential for pushing the frontier in basic quantitative,
knowledge-intensive, and complex quantitative tasks, respectively. Finally, we
assess the cost-reductions afforded by common inference-time techniques like
majority voting and self-refinement, finding that their marginal accuracy gains
rarely justify their costs. Our findings underscore that complementary
model-level innovations are the primary drivers of cost-efficiency, and our
economic framework provides a principled tool for measuring this progress and
guiding deployment.Summary
AI-Generated Summary