GISA:通用信息检索助手基准评测体系
GISA: A Benchmark for General Information-Seeking Assistant
February 9, 2026
作者: Yutao Zhu, Xingshuo Zhang, Maosen Zhang, Jiajie Jin, Liancheng Zhang, Xiaoshuai Song, Kangzhi Zhao, Wencong Zeng, Ruiming Tang, Han Li, Ji-Rong Wen, Zhicheng Dou
cs.AI
摘要
大型语言模型(LLM)的进步显著加速了能够通过多轮网络交互自主收集信息的搜索智能体发展。目前已有多种基准被提出用于评估此类智能体,但现有基准往往通过从答案反向构建查询的方式,产生与真实需求不符的非自然任务。此外,这些基准通常侧重于定位特定信息或聚合多源信息,同时依赖易受数据污染影响的静态答案集。为弥补这些不足,我们推出GISA基准——一个面向通用信息检索助手、包含373个人工精心设计的真实信息检索场景的评估体系。GISA具有四种结构化答案格式(项目、集合、列表和表格),支持确定性评估。该基准在统一任务中深度融合深度推理与广泛信息聚合,并包含定期更新答案的动态子集以抵抗记忆效应。值得注意的是,GISA为每个查询提供完整的人类搜索轨迹,为过程级监督和模仿学习提供黄金标准参考。对主流LLM及商业搜索产品的实验表明,即使表现最佳的模型也仅达到19.30%的精确匹配度,且在需要复杂规划和全面信息收集的任务中性能显著下降。这些发现揭示了未来改进的巨大空间。
English
The advancement of large language models (LLMs) has significantly accelerated the development of search agents capable of autonomously gathering information through multi-turn web interactions. Various benchmarks have been proposed to evaluate such agents. However, existing benchmarks often construct queries backward from answers, producing unnatural tasks misaligned with real-world needs. Moreover, these benchmarks tend to focus on either locating specific information or aggregating information from multiple sources, while relying on static answer sets prone to data contamination. To bridge these gaps, we introduce GISA, a benchmark for General Information-Seeking Assistants comprising 373 human-crafted queries that reflect authentic information-seeking scenarios. GISA features four structured answer formats (item, set, list, and table), enabling deterministic evaluation. It integrates both deep reasoning and broad information aggregation within unified tasks, and includes a live subset with periodically updated answers to resist memorization. Notably, GISA provides complete human search trajectories for every query, offering gold-standard references for process-level supervision and imitation learning. Experiments on mainstream LLMs and commercial search products reveal that even the best-performing model achieves only 19.30\% exact match score, with performance notably degrading on tasks requiring complex planning and comprehensive information gathering. These findings highlight substantial room for future improvement.