GISA:通用信息检索助手基准评测体系
GISA: A Benchmark for General Information-Seeking Assistant
February 9, 2026
作者: Yutao Zhu, Xingshuo Zhang, Maosen Zhang, Jiajie Jin, Liancheng Zhang, Xiaoshuai Song, Kangzhi Zhao, Wencong Zeng, Ruiming Tang, Han Li, Ji-Rong Wen, Zhicheng Dou
cs.AI
摘要
大型语言模型(LLM)的进步显著推动了搜索代理的发展,使其能够通过多轮网络交互自主收集信息。目前已有多种基准测试被提出以评估此类代理。然而,现有基准往往从答案反向构建查询,产生与真实需求脱节的非自然任务。此外,这些基准通常侧重于定位特定信息或聚合多源信息,同时依赖易受数据污染影响的静态答案集。为弥补这些不足,我们推出GISA基准测试——面向通用信息检索助手的评估体系,包含373个反映真实信息检索场景的人工构建查询。GISA具有四种结构化答案格式(单项、集合、列表和表格),支持确定性评估。该基准在统一任务中融合深度推理与广泛信息聚合,并包含定期更新答案的动态子集以抵抗记忆效应。值得注意的是,GISA为每个查询提供完整的人类搜索轨迹,为过程级监督和模仿学习提供黄金标准参考。对主流LLM和商业搜索产品的实验表明,即使表现最佳的模型精确匹配率也仅为19.30%,且在需要复杂规划和全面信息收集的任务中性能显著下降。这些发现揭示了未来改进的巨大空间。
English
The advancement of large language models (LLMs) has significantly accelerated the development of search agents capable of autonomously gathering information through multi-turn web interactions. Various benchmarks have been proposed to evaluate such agents. However, existing benchmarks often construct queries backward from answers, producing unnatural tasks misaligned with real-world needs. Moreover, these benchmarks tend to focus on either locating specific information or aggregating information from multiple sources, while relying on static answer sets prone to data contamination. To bridge these gaps, we introduce GISA, a benchmark for General Information-Seeking Assistants comprising 373 human-crafted queries that reflect authentic information-seeking scenarios. GISA features four structured answer formats (item, set, list, and table), enabling deterministic evaluation. It integrates both deep reasoning and broad information aggregation within unified tasks, and includes a live subset with periodically updated answers to resist memorization. Notably, GISA provides complete human search trajectories for every query, offering gold-standard references for process-level supervision and imitation learning. Experiments on mainstream LLMs and commercial search products reveal that even the best-performing model achieves only 19.30\% exact match score, with performance notably degrading on tasks requiring complex planning and comprehensive information gathering. These findings highlight substantial room for future improvement.