ChatPaper.aiChatPaper

InteractComp:基於模糊查詢的搜尋代理評估框架

InteractComp: Evaluating Search Agents With Ambiguous Queries

October 28, 2025
作者: Mingyi Deng, Lijun Huang, Yani Fan, Jiayi Zhang, Fashen Ren, Jinyi Bai, Fuzhen Yang, Dayi Miao, Zhaoyang Yu, Yifan Wu, Yanfei Zhang, Fengwei Teng, Yingjia Wan, Song Hu, Yude Li, Xin Jin, Conghao Hu, Haoyu Li, Qirui Fu, Tai Zhong, Xinyu Wang, Xiangru Tang, Nan Tang, Chenglin Wu, Yuyu Luo
cs.AI

摘要

語言代理在網路搜索與資訊檢索領域展現出卓越潛力。然而現有搜尋代理皆假設使用者查詢具備完整性與明確性,此假設與現實情境存在顯著差異——使用者往往從模糊的初始查詢出發,需透過互動逐步釐清需求。當前多數代理缺乏搜尋過程中的互動機制,且現有評測基準無法有效評估此項能力。為填補此空白,我們提出InteractComp評測框架,專門檢驗搜尋代理能否識別查詢模糊性並主動透過互動化解歧義。遵循「易驗證、互動消歧義」原則,我們採用目標-干擾項方法建構涵蓋9大領域的210道專家級題目,創造僅能透過互動解決的真實模糊情境。對17個模型的評估揭示驚人缺陷:最佳模型僅達13.73%準確率(完整上下文情境下可達71.50%),暴露系統性過度自信而非推理能力不足。強制互動策略帶來顯著提升,證明現有策略未能激發潛在能力。縱向分析顯示互動能力在15個月內停滯不前,而搜尋性能提升七倍,揭示關鍵發展盲點。這種停滯現象結合搜尋任務固有的即時回饋特性,使InteractComp成為評估與訓練搜尋代理互動能力的寶貴資源。程式碼已開源於:https://github.com/FoundationAgents/InteractComp。
English
Language agents have demonstrated remarkable potential in web search and information retrieval. However, these search agents assume user queries are complete and unambiguous, an assumption that diverges from reality where users begin with incomplete queries requiring clarification through interaction. Yet most agents lack interactive mechanisms during the search process, and existing benchmarks cannot assess this capability. To address this gap, we introduce InteractComp, a benchmark designed to evaluate whether search agents can recognize query ambiguity and actively interact to resolve it during search. Following the principle of easy to verify, interact to disambiguate, we construct 210 expert-curated questions across 9 domains through a target-distractor methodology that creates genuine ambiguity resolvable only through interaction. Evaluation of 17 models reveals striking failure: the best model achieves only 13.73% accuracy despite 71.50% with complete context, exposing systematic overconfidence rather than reasoning deficits. Forced interaction produces dramatic gains, demonstrating latent capability current strategies fail to engage. Longitudinal analysis shows interaction capabilities stagnated over 15 months while search performance improved seven-fold, revealing a critical blind spot. This stagnation, coupled with the immediate feedback inherent to search tasks, makes InteractComp a valuable resource for both evaluating and training interaction capabilities in search agents. The code is available at https://github.com/FoundationAgents/InteractComp.
PDF962December 1, 2025