模型上下文协议(MCP)工具描述存在缺陷!通过增强MCP工具描述提升AI代理效率的新路径
Model Context Protocol (MCP) Tool Descriptions Are Smelly! Towards Improving AI Agent Efficiency with Augmented MCP Tool Descriptions
February 16, 2026
作者: Mohammed Mehedi Hasan, Hao Li, Gopi Krishnan Rajbahadur, Bram Adams, Ahmed E. Hassan
cs.AI
摘要
模型上下文协议(MCP)提出了一套标准规范,旨在指导基于基础模型(FM)的智能体如何通过调用工具与外部系统交互。然而,为理解工具的功能特性,基础模型需依赖自然语言描述的工具说明,这使得工具说明成为引导基础模型为特定(子)任务选择最优工具并传递正确参数的关键要素。虽然工具说明中的缺陷或瑕疵可能误导基于基础模型的智能体,但这些问题的普遍性及其在MCP生态系统中的影响尚不明确。
为此,我们实证研究了103个MCP服务器中的856个工具,评估其说明质量及对智能体性能的影响。我们从文献中提炼出工具说明的六个核心构成要素,据此制定评分标准,并基于该标准形式化定义了工具说明的瑕疵特征。通过基于基础模型的扫描器实施该标准,我们发现97.1%的被分析工具说明存在至少一处瑕疵,其中56%未能清晰阐述其功能目的。虽然通过增补所有构成要素使任务成功率中位数提升5.85个百分点,部分目标完成率提高15.12%,但执行步骤数也增加了67.46%,且在16.67%的情况下出现性能衰退。这些结果表明性能提升并非易事:执行成本可能成为权衡因素,而执行上下文也会产生影响。此外,组件消融实验显示,不同组件组合的紧凑变体往往能保持行为可靠性,同时减少不必要的令牌开销,从而更高效地利用基础模型上下文窗口并降低执行成本。
English
The Model Context Protocol (MCP) introduces a standard specification that defines how Foundation Model (FM)-based agents should interact with external systems by invoking tools. However, to understand a tool's purpose and features, FMs rely on natural-language tool descriptions, making these descriptions a critical component in guiding FMs to select the optimal tool for a given (sub)task and to pass the right arguments to the tool. While defects or smells in these descriptions can misguide FM-based agents, their prevalence and consequences in the MCP ecosystem remain unclear.
Hence, we examine 856 tools spread across 103 MCP servers empirically, assess their description quality, and their impact on agent performance. We identify six components of tool descriptions from the literature, develop a scoring rubric utilizing these components, and then formalize tool description smells based on this rubric. By operationalizing this rubric through an FM-based scanner, we find that 97.1% of the analyzed tool descriptions contain at least one smell, with 56% failing to state their purpose clearly. While augmenting these descriptions for all components improves task success rates by a median of 5.85 percentage points and improves partial goal completion by 15.12%, it also increases the number of execution steps by 67.46% and regresses performance in 16.67% of cases. These results indicate that achieving performance gains is not straightforward; while execution cost can act as a trade-off, execution context can also impact. Furthermore, component ablations show that compact variants of different component combinations often preserve behavioral reliability while reducing unnecessary token overhead, enabling more efficient use of the FM context window and lower execution costs.