ChatPaper.aiChatPaper

BrowseComp-ZH:中文大语言模型网页浏览能力基准测试

BrowseComp-ZH: Benchmarking Web Browsing Ability of Large Language Models in Chinese

April 27, 2025
作者: Peilin Zhou, Bruce Leon, Xiang Ying, Can Zhang, Yifan Shao, Qichen Ye, Dading Chong, Zhiling Jin, Chenxuan Xie, Meng Cao, Yuxin Gu, Sixin Hong, Jing Ren, Jian Chen, Chao Liu, Yining Hua
cs.AI

摘要

随着大型语言模型(LLMs)逐步进化为工具使用型智能体,实时浏览网页的能力已成为衡量其推理与检索能力的关键指标。现有基准测试如BrowseComp主要聚焦于英语,却忽视了其他主要信息生态系统——尤其是中文——在语言、基础设施及审查制度等方面的复杂性。为填补这一空白,我们推出了BrowseComp-ZH,这是一个专为全面评估LLM智能体在中文网络环境下的表现而设计的高难度基准测试。BrowseComp-ZH包含289个跨11个不同领域的多跳问题,每个问题均逆向工程自简短、客观且易于验证的答案(如日期、数字或专有名词)。我们采用两阶段质量控制流程,力求问题的高难度与答案的唯一性。在BrowseComp-ZH上,我们对超过20个顶尖语言模型及代理搜索系统进行了基准测试。尽管这些模型在对话与检索方面表现出色,但大多数模型表现严重不佳:大量模型的准确率低于10%,仅有少数超过20%。即便是表现最佳的系统——OpenAI的DeepResearch,也仅达到42.9%的准确率。这些结果凸显了BrowseComp-ZH的极大挑战性,成功不仅需要高效的检索策略,还需复杂的推理与信息整合能力——这些正是当前模型尚待提升的关键。我们的数据集、构建指南及基准测试结果已公开发布于https://github.com/PALIN2018/BrowseComp-ZH。
English
As large language models (LLMs) evolve into tool-using agents, the ability to browse the web in real-time has become a critical yardstick for measuring their reasoning and retrieval competence. Existing benchmarks such as BrowseComp concentrate on English and overlook the linguistic, infrastructural, and censorship-related complexities of other major information ecosystems -- most notably Chinese. To address this gap, we introduce BrowseComp-ZH, a high-difficulty benchmark purpose-built to comprehensively evaluate LLM agents on the Chinese web. BrowseComp-ZH consists of 289 multi-hop questions spanning 11 diverse domains. Each question is reverse-engineered from a short, objective, and easily verifiable answer (e.g., a date, number, or proper noun). A two-stage quality control protocol is applied to strive for high question difficulty and answer uniqueness. We benchmark over 20 state-of-the-art language models and agentic search systems on our proposed BrowseComp-ZH. Despite their strong conversational and retrieval capabilities, most models struggle severely: a large number achieve accuracy rates below 10%, and only a handful exceed 20%. Even the best-performing system, OpenAI's DeepResearch, reaches just 42.9%. These results demonstrate the considerable difficulty of BrowseComp-ZH, where success demands not only effective retrieval strategies, but also sophisticated reasoning and information reconciliation -- capabilities that current models still struggle to master. Our dataset, construction guidelines, and benchmark results have been publicly released at https://github.com/PALIN2018/BrowseComp-ZH.

Summary

AI-Generated Summary

PDF41May 9, 2025