MSC-Bench:面向多服务器工具编排的严谨基准测试框架
MSC-Bench: A Rigorous Benchmark for Multi-Server Tool Orchestration
October 22, 2025
作者: Jia-Kai Dong, I-Wei Huang, Chun-Tin Wu, Yi-Tien Tsai
cs.AI
摘要
我们推出MSC-Bench——一个在分层模型上下文协议(MCP)生态系统中评估LLM智能体多跳端到端工具编排能力的大规模基准。现有基准常孤立评估工具功能,忽视了功能重叠与跨服务器编排等挑战,导致评估结果过于乐观。MSC-Bench通过构建"等效功能集"作为基准真值,采用F1分数等客观指标降低对LLM即评判的依赖,从而弥补这些不足。该基准采用五级课程化设计,系统化测试智能体从单工具编排到复杂跨服务器规划的能力,以及对超范围请求的鲁棒性。实验表明,缺乏协同设计的刚性层次结构会制约性能表现,即使最先进的智能体在鲁棒性方面也存在系统性缺陷。MSC-Bench提供诊断框架以揭示这些局限,指导开发更高效能工具使用智能体。基准及相关资源已公开于https://github.com/snooow1029/MSC_Bench。
English
We introduce MSC-Bench, a large-scale benchmark for evaluating multi-hop,
end-to-end tool orchestration by LLM agents in a hierarchical Model-Context
Protocol (MCP) ecosystem. Existing benchmarks often evaluate tools in
isolation, ignoring challenges such as functional overlap and cross-server
orchestration, leading to overly optimistic assessments. MSC-Bench addresses
these gaps by constructing ground truth through 'equal function sets', allowing
objective metrics such as F1 score and reducing the dependency on
LLM-as-a-judge evaluation. Organized as a five-level curriculum, it
systematically tests agent capabilities from single-tool orchestration to
complex cross-server planning, and robustness to out-of-scope requests.
Experiments reveal that rigid hierarchies can hinder performance without
co-designed strategies, and even state-of-the-art agents exhibit systemic
weaknesses in robustness. MSC-Bench provides a diagnostic framework to expose
these limitations and guide the development of more capable and efficient
tool-using agents. The benchmark and resources are publicly available at
https://github.com/snooow1029/MSC_Bench.