ChatPaper.aiChatPaper

MSC-Bench:面向多服务器工具编排的严谨基准测试框架

MSC-Bench: A Rigorous Benchmark for Multi-Server Tool Orchestration

October 22, 2025
作者: Jia-Kai Dong, I-Wei Huang, Chun-Tin Wu, Yi-Tien Tsai
cs.AI

摘要

我们推出MSC-Bench——一个在分层模型上下文协议(MCP)生态系统中评估LLM智能体多跳端到端工具编排能力的大规模基准。现有基准常孤立评估工具功能,忽视了功能重叠与跨服务器编排等挑战,导致评估结果过于乐观。MSC-Bench通过构建"等效功能集"作为基准真值,采用F1分数等客观指标降低对LLM即评判的依赖,从而弥补这些不足。该基准采用五级课程化设计,系统化测试智能体从单工具编排到复杂跨服务器规划的能力,以及对超范围请求的鲁棒性。实验表明,缺乏协同设计的刚性层次结构会制约性能表现,即使最先进的智能体在鲁棒性方面也存在系统性缺陷。MSC-Bench提供诊断框架以揭示这些局限,指导开发更高效能工具使用智能体。基准及相关资源已公开于https://github.com/snooow1029/MSC_Bench。
English
We introduce MSC-Bench, a large-scale benchmark for evaluating multi-hop, end-to-end tool orchestration by LLM agents in a hierarchical Model-Context Protocol (MCP) ecosystem. Existing benchmarks often evaluate tools in isolation, ignoring challenges such as functional overlap and cross-server orchestration, leading to overly optimistic assessments. MSC-Bench addresses these gaps by constructing ground truth through 'equal function sets', allowing objective metrics such as F1 score and reducing the dependency on LLM-as-a-judge evaluation. Organized as a five-level curriculum, it systematically tests agent capabilities from single-tool orchestration to complex cross-server planning, and robustness to out-of-scope requests. Experiments reveal that rigid hierarchies can hinder performance without co-designed strategies, and even state-of-the-art agents exhibit systemic weaknesses in robustness. MSC-Bench provides a diagnostic framework to expose these limitations and guide the development of more capable and efficient tool-using agents. The benchmark and resources are publicly available at https://github.com/snooow1029/MSC_Bench.
PDF42December 2, 2025