推論のシャドウプライス：大規模言語モデルへの最適予算配分の経済学的視点

要旨

推論時スケーリングは、大規模言語モデルの性能を向上させる重要な手段として浮上してきたが、実際の展開は厳格な計算予算によって制約されている。本研究では、推論予算の割り当てを、経済原理に基づく大域的制約付き最適化問題として定式化する。各クエリの推論効用をシフトサージ関数でモデル化することにより、資源希少性の下で限界効用を均衡させる大域的シャドープライスに基づく最適な割り当てポリシーを導出する。この理論に基づき、我々は推論のための制約付き潜在効用均衡割り当て（CLEAR）を提案する。これは合理的な放棄を実行し、資源を支払不能なクエリから、出現閾値付近の解決可能なクエリに再配分する。異なるトラフィックストリームを用いた複数の推論タスクにおける広範な実験により、CLEARが総トークンコスト対平均精度のパレートフロンティアを大幅に改善することを示す。資源希少な状況では、CLEARは均一割り当てと比較して最大3倍の大域精度向上を達成する。

English

Inference-time scaling has emerged as a critical avenue for enhancing Large Language Models' performance, yet real-world deployment is constrained by strict computational budgets. In this work, we formulate inference budget allocation as a global constrained optimization problem governed by economic principles. By modeling per-query reasoning utility with a shifted-surge function, we derive an optimal allocation policy based on a global shadow price that equilibrates marginal utility under resource scarcity. Based on this theory, we propose Constrained Latent-utility Equilibrium Allocation for Reasoning (CLEAR). It performs rational abandonment and reallocates resources from insolvent queries to solvable queries near their emergence thresholds. Extensive experiments on several reasoning tasks with different traffic streams demonstrate that CLEAR significantly improves the Pareto frontier of total token cost versus mean accuracy. In resource-scarce regimes, CLEAR achieves up to a 3x improvement in global accuracy compared to uniform allocation.