AgentSearchBench:面向真实场景的AI代理搜索基准测试
AgentSearchBench: A Benchmark for AI Agent Search in the Wild
April 24, 2026
作者: Bin Wu, Arastun Mammadli, Xiaoyu Zhang, Emine Yilmaz
cs.AI
摘要
人工智能代理生态系统的快速发展正在改变复杂任务的委托与执行方式,同时也带来了如何为特定任务匹配合适代理的新挑战。与传统工具不同,代理能力通常具有组合性和执行依赖性,仅通过文本描述难以准确评估。然而现有研究和基准测试通常假设功能明确、候选池受控或仅支持可执行任务查询,导致现实中的代理搜索场景研究不足。我们推出AgentSearchBench——一个面向真实场景的大规模代理搜索基准,基于来自多个平台的近10,000个真实世界代理构建。该基准将代理搜索形式化为可执行任务查询和高级任务描述下的检索与重排序问题,并通过执行驱动的性能信号评估相关性。实验表明语义相似度与实际代理性能之间存在持续差距,暴露出基于描述的检索与重排序方法的局限性。我们进一步证明,轻量级行为信号(包括执行感知探测)能显著提升排序质量,这凸显了将执行信号纳入代理发现机制的重要性。代码已开源:https://github.com/Bingo-W/AgentSearchBench。
English
The rapid growth of AI agent ecosystems is transforming how complex tasks are delegated and executed, creating a new challenge of identifying suitable agents for a given task. Unlike traditional tools, agent capabilities are often compositional and execution-dependent, making them difficult to assess from textual descriptions alone. However, existing research and benchmarks typically assume well-specified functionalities, controlled candidate pools, or only executable task queries, leaving realistic agent search scenarios insufficiently studied. We introduce AgentSearchBench, a large-scale benchmark for agent search in the wild, built from nearly 10,000 real-world agents across multiple providers. The benchmark formalizes agent search as retrieval and reranking problems under both executable task queries and high-level task descriptions, and evaluates relevance using execution-grounded performance signals. Experiments reveal a consistent gap between semantic similarity and actual agent performance, exposing the limitations of description-based retrieval and reranking methods. We further show that lightweight behavioral signals, including execution-aware probing, can substantially improve ranking quality, highlighting the importance of incorporating execution signals into agent discovery. Our code is available at https://github.com/Bingo-W/AgentSearchBench.