CostBench:评估LLM工具使用代理在动态环境中的多轮次成本最优规划与适应能力
CostBench: Evaluating Multi-Turn Cost-Optimal Planning and Adaptation in Dynamic Environments for LLM Tool-Use Agents
November 4, 2025
作者: Jiayu Liu, Cheng Qian, Zhaochen Su, Qing Zong, Shijue Huang, Bingxiang He, Yi R. Fung
cs.AI
摘要
当前对大语言模型智能体的评估主要聚焦于任务完成度,往往忽视了资源效率与适应性这一关键维度。这种评估盲区忽略了智能体的核心能力:在动态环境中制定并调整成本最优方案的能力。为弥补这一缺陷,我们推出CostBench——一个可扩展的成本导向型基准测试框架,专门用于评估智能体的经济推理与动态重规划能力。该框架以旅行规划为应用场景,包含一系列可通过不同原子工具与复合工具组合解决的任务,这些工具具有多样化且可定制的成本属性。同时,CostBench支持工具故障、成本波动等四类动态阻断事件,以模拟现实世界的不确定性并驱动智能体实时调整策略。通过对主流开源与商业模型在CostBench上的测试发现,现有智能体在成本感知规划方面存在显著不足:在静态环境下往往无法找到成本最优解,即便GPT-5在最困难任务上的精确匹配率也不足75%,而在动态场景下性能进一步下降约40%。通过系统诊断这些缺陷,CostBench为开发兼具经济合理性与鲁棒性的下一代智能体奠定了重要基础。
English
Current evaluations of Large Language Model (LLM) agents primarily emphasize
task completion, often overlooking resource efficiency and adaptability. This
neglects a crucial capability: agents' ability to devise and adjust
cost-optimal plans in response to changing environments. To bridge this gap, we
introduce CostBench, a scalable, cost-centric benchmark designed to evaluate
agents' economic reasoning and replanning abilities. Situated in the
travel-planning domain, CostBench comprises tasks solvable via multiple
sequences of atomic and composite tools with diverse, customizable costs. It
also supports four types of dynamic blocking events, such as tool failures and
cost changes, to simulate real-world unpredictability and necessitate agents to
adapt in real time. Evaluating leading open-sourced and proprietary models on
CostBench reveals a substantial gap in cost-aware planning: agents frequently
fail to identify cost-optimal solutions in static settings, with even GPT-5
achieving less than 75% exact match rate on the hardest tasks, and performance
further dropping by around 40% under dynamic conditions. By diagnosing these
weaknesses, CostBench lays the groundwork for developing future agents that are
both economically rational and robust.