推理的影子价格：大型语言模型最优预算分配的经济视角

摘要

推理时扩展已成为提升大语言模型性能的关键途径，但在实际部署中仍受到严格计算预算的约束。本文将推理预算分配问题形式化为一个受经济学原理支配的全局约束优化问题。通过采用偏移激增函数对每次查询的推理效用进行建模，我们推导出一种基于全局影子价格的最优分配策略，该价格在资源稀缺条件下实现边际效用的均衡。基于这一理论，我们提出了约束潜在效用均衡分配推理方法（CLEAR）。该方法执行理性舍弃，并将资源从不可行查询重新分配给接近其涌现阈值的可解查询。在多种推理任务及不同流量场景下的大量实验表明，CLEAR显著改善了总令牌成本与平均准确率之间的帕累托前沿。在资源稀缺场景中，与均匀分配相比，CLEAR的全局准确率提升高达3倍。

English

Inference-time scaling has emerged as a critical avenue for enhancing Large Language Models' performance, yet real-world deployment is constrained by strict computational budgets. In this work, we formulate inference budget allocation as a global constrained optimization problem governed by economic principles. By modeling per-query reasoning utility with a shifted-surge function, we derive an optimal allocation policy based on a global shadow price that equilibrates marginal utility under resource scarcity. Based on this theory, we propose Constrained Latent-utility Equilibrium Allocation for Reasoning (CLEAR). It performs rational abandonment and reallocates resources from insolvent queries to solvable queries near their emergence thresholds. Extensive experiments on several reasoning tasks with different traffic streams demonstrate that CLEAR significantly improves the Pareto frontier of total token cost versus mean accuracy. In resource-scarce regimes, CLEAR achieves up to a 3x improvement in global accuracy compared to uniform allocation.