AgentSearchBench：野外AI代理搜尋基準測試

摘要

人工智慧代理生態系統的快速發展正在改變複雜任務的委派與執行方式，同時也帶來了辨識合適任務代理的新挑戰。與傳統工具不同，代理的能力通常具有組合性與執行依賴性，僅透過文字描述難以準確評估。然而現有研究與基準測試往往假設功能明確、候選池受控，或僅支援可執行任務查詢，導致現實中的代理搜尋情境未能得到充分研究。我們提出AgentSearchBench——一個基於近萬個跨平台真實代理構建的大規模開放環境代理搜尋基準。該基準將代理搜尋形式化為可執行任務查詢與高層級任務描述下的檢索與重排序問題，並透過執行錨定的效能訊號評估相關性。實驗顯示語義相似性與實際代理效能存在持續落差，暴露了基於描述的檢索與重排序方法的局限性。我們進一步證明，輕量級行為訊號（包括執行感知探測）能顯著提升排序品質，凸顯了將執行訊號納入代理發現機制的重要性。程式碼已開源於：https://github.com/Bingo-W/AgentSearchBench。

English

The rapid growth of AI agent ecosystems is transforming how complex tasks are delegated and executed, creating a new challenge of identifying suitable agents for a given task. Unlike traditional tools, agent capabilities are often compositional and execution-dependent, making them difficult to assess from textual descriptions alone. However, existing research and benchmarks typically assume well-specified functionalities, controlled candidate pools, or only executable task queries, leaving realistic agent search scenarios insufficiently studied. We introduce AgentSearchBench, a large-scale benchmark for agent search in the wild, built from nearly 10,000 real-world agents across multiple providers. The benchmark formalizes agent search as retrieval and reranking problems under both executable task queries and high-level task descriptions, and evaluates relevance using execution-grounded performance signals. Experiments reveal a consistent gap between semantic similarity and actual agent performance, exposing the limitations of description-based retrieval and reranking methods. We further show that lightweight behavioral signals, including execution-aware probing, can substantially improve ranking quality, highlighting the importance of incorporating execution signals into agent discovery. Our code is available at https://github.com/Bingo-W/AgentSearchBench.

AgentSearchBench：野外AI代理搜尋基準測試

AgentSearchBench: A Benchmark for AI Agent Search in the Wild

摘要

Support