추론의 그림자 가격: LLM을 위한 최적 예산 배분에 대한 경제적 관점

초록

추론 시간 확장은 대규모 언어 모델의 성능을 향상시키는 중요한 방법으로 부상했지만, 실제 배포는 엄격한 계산 예산에 의해 제약을 받는다. 본 연구에서는 추론 예산 할당을 경제 원리에 기반한 전역 제약 최적화 문제로 정식화한다. 각 질의의 추론 효용을 이동-급증 함수로 모델링함으로써, 자원 부족 상태에서 한계 효용을 균형화하는 전역 그림자 가격에 기반한 최적 할당 정책을 도출한다. 이 이론을 바탕으로, 본 논문은 추론을 위한 제약적 잠재 효용 균형 할당(CLEAR)을 제안한다. 이는 합리적 포기를 수행하고, 자원을 부실 질의에서 발현 임계값 근처에 있는 해결 가능한 질의로 재할당한다. 다양한 트래픽 흐름에서 여러 추론 과제에 대한 광범위한 실험 결과, CLEAR가 총 토큰 비용 대 평균 정확도의 파레토 경계를 유의미하게 개선함을 보여준다. 자원 부족 환경에서는 CLEAR가 균등 할당에 비해 전역 정확도를 최대 3배까지 향상시킨다.

English

Inference-time scaling has emerged as a critical avenue for enhancing Large Language Models' performance, yet real-world deployment is constrained by strict computational budgets. In this work, we formulate inference budget allocation as a global constrained optimization problem governed by economic principles. By modeling per-query reasoning utility with a shifted-surge function, we derive an optimal allocation policy based on a global shadow price that equilibrates marginal utility under resource scarcity. Based on this theory, we propose Constrained Latent-utility Equilibrium Allocation for Reasoning (CLEAR). It performs rational abandonment and reallocates resources from insolvent queries to solvable queries near their emergence thresholds. Extensive experiments on several reasoning tasks with different traffic streams demonstrate that CLEAR significantly improves the Pareto frontier of total token cost versus mean accuracy. In resource-scarce regimes, CLEAR achieves up to a 3x improvement in global accuracy compared to uniform allocation.