CostBench:評估LLM工具使用代理在動態環境中的多輪成本最優規劃與適應能力
CostBench: Evaluating Multi-Turn Cost-Optimal Planning and Adaptation in Dynamic Environments for LLM Tool-Use Agents
November 4, 2025
作者: Jiayu Liu, Cheng Qian, Zhaochen Su, Qing Zong, Shijue Huang, Bingxiang He, Yi R. Fung
cs.AI
摘要
當前對大型語言模型(LLM)代理的評估主要側重於任務完成度,往往忽略資源效率與適應性。這種評估盲點忽略了一項關鍵能力:代理在環境變化時制定並調整成本最優化方案的能力。為彌補這一缺口,我們提出CostBench——一個可擴展的成本導向基準測試框架,旨在評估代理的經濟推理與重規劃能力。該框架以旅行規劃領域為應用場景,包含可通過多種原子工具與複合工具序列解決的任務,且各工具具備多樣化、可自訂的成本屬性。CostBench還支持四類動態阻斷事件(如工具故障與成本變動),以模擬現實世界的不確定性,促使代理進行即時適應。通過對領先開源模型與專有模型進行CostBench測試,我們發現成本感知規劃存在顯著缺陷:代理在靜態環境中常無法識別成本最優解,即便GPT-5在最難任務上的精確匹配率也低於75%,而在動態條件下性能進一步下降約40%。透過診斷這些弱點,CostBench為開發兼具經濟合理性與魯棒性的下一代代理奠定了基礎。
English
Current evaluations of Large Language Model (LLM) agents primarily emphasize
task completion, often overlooking resource efficiency and adaptability. This
neglects a crucial capability: agents' ability to devise and adjust
cost-optimal plans in response to changing environments. To bridge this gap, we
introduce CostBench, a scalable, cost-centric benchmark designed to evaluate
agents' economic reasoning and replanning abilities. Situated in the
travel-planning domain, CostBench comprises tasks solvable via multiple
sequences of atomic and composite tools with diverse, customizable costs. It
also supports four types of dynamic blocking events, such as tool failures and
cost changes, to simulate real-world unpredictability and necessitate agents to
adapt in real time. Evaluating leading open-sourced and proprietary models on
CostBench reveals a substantial gap in cost-aware planning: agents frequently
fail to identify cost-optimal solutions in static settings, with even GPT-5
achieving less than 75% exact match rate on the hardest tasks, and performance
further dropping by around 40% under dynamic conditions. By diagnosing these
weaknesses, CostBench lays the groundwork for developing future agents that are
both economically rational and robust.