ChatPaper.aiChatPaper

交互式竞赛:基于模糊查询的搜索智能体评估

InteractComp: Evaluating Search Agents With Ambiguous Queries

October 28, 2025
作者: Mingyi Deng, Lijun Huang, Yani Fan, Jiayi Zhang, Fashen Ren, Jinyi Bai, Fuzhen Yang, Dayi Miao, Zhaoyang Yu, Yifan Wu, Yanfei Zhang, Fengwei Teng, Yingjia Wan, Song Hu, Yude Li, Xin Jin, Conghao Hu, Haoyu Li, Qirui Fu, Tai Zhong, Xinyu Wang, Xiangru Tang, Nan Tang, Chenglin Wu, Yuyu Luo
cs.AI

摘要

语言智能体在网络搜索和信息检索领域展现出巨大潜力。然而现有搜索智能体普遍假设用户查询是完整且明确的,这种假设与现实场景存在偏差——用户往往以不完整的查询开始搜索,需要通过交互进行澄清。当前多数智能体缺乏搜索过程中的交互机制,现有基准测试也无法评估这种能力。为填补这一空白,我们推出InteractComp基准测试框架,专门评估搜索智能体能否识别查询歧义并在搜索过程中主动交互以消除歧义。 遵循"易于验证、交互消歧"的原则,我们采用目标-干扰项方法在9个领域构建了210道专家精编问题,这些问题具有真实的歧义性且只能通过交互解决。对17个模型的评估揭示出惊人缺陷:即便在完整上下文条件下准确率达71.50%,最佳模型在交互场景下的准确率仅为13.73%,暴露出系统性的过度自信而非推理能力不足。强制交互能带来显著效果提升,证明现有策略未能有效激发模型的潜在能力。 纵向分析表明,在搜索性能提升七倍的同时,交互能力在15个月内停滞不前,这揭示出关键的技术盲区。这种能力停滞与搜索任务固有的即时反馈特性,使得InteractComp成为评估和训练搜索智能体交互能力的宝贵资源。代码已开源:https://github.com/FoundationAgents/InteractComp。
English
Language agents have demonstrated remarkable potential in web search and information retrieval. However, these search agents assume user queries are complete and unambiguous, an assumption that diverges from reality where users begin with incomplete queries requiring clarification through interaction. Yet most agents lack interactive mechanisms during the search process, and existing benchmarks cannot assess this capability. To address this gap, we introduce InteractComp, a benchmark designed to evaluate whether search agents can recognize query ambiguity and actively interact to resolve it during search. Following the principle of easy to verify, interact to disambiguate, we construct 210 expert-curated questions across 9 domains through a target-distractor methodology that creates genuine ambiguity resolvable only through interaction. Evaluation of 17 models reveals striking failure: the best model achieves only 13.73% accuracy despite 71.50% with complete context, exposing systematic overconfidence rather than reasoning deficits. Forced interaction produces dramatic gains, demonstrating latent capability current strategies fail to engage. Longitudinal analysis shows interaction capabilities stagnated over 15 months while search performance improved seven-fold, revealing a critical blind spot. This stagnation, coupled with the immediate feedback inherent to search tasks, makes InteractComp a valuable resource for both evaluating and training interaction capabilities in search agents. The code is available at https://github.com/FoundationAgents/InteractComp.
PDF962December 1, 2025