BrowseComp-ZH:大型語言模型中文網頁瀏覽能力基準測試
BrowseComp-ZH: Benchmarking Web Browsing Ability of Large Language Models in Chinese
April 27, 2025
作者: Peilin Zhou, Bruce Leon, Xiang Ying, Can Zhang, Yifan Shao, Qichen Ye, Dading Chong, Zhiling Jin, Chenxuan Xie, Meng Cao, Yuxin Gu, Sixin Hong, Jing Ren, Jian Chen, Chao Liu, Yining Hua
cs.AI
摘要
隨著大型語言模型(LLMs)逐漸演變為工具使用代理,實時瀏覽網頁的能力已成為衡量其推理和檢索能力的關鍵指標。現有的基準測試如BrowseComp主要集中於英語,並忽視了其他主要信息生態系統(尤其是中文)在語言、基礎設施和審查相關方面的複雜性。為填補這一空白,我們引入了BrowseComp-ZH,這是一個專為全面評估LLM代理在中文網絡上的表現而設計的高難度基準測試。BrowseComp-ZH包含289個跨11個不同領域的多跳問題。每個問題都是從一個簡短、客觀且易於驗證的答案(例如日期、數字或專有名詞)逆向工程而來。我們應用了一個兩階段的質量控制協議,以確保問題的高難度和答案的唯一性。我們在提出的BrowseComp-ZH上對超過20個最先進的語言模型和代理搜索系統進行了基準測試。儘管這些模型具有強大的對話和檢索能力,但大多數模型表現嚴重不佳:大量模型的準確率低於10%,只有少數超過20%。即使是表現最好的系統,OpenAI的DeepResearch,也僅達到42.9%。這些結果表明BrowseComp-ZH的難度相當大,成功不僅需要有效的檢索策略,還需要複雜的推理和信息協調能力——這些能力當前模型仍然難以掌握。我們的數據集、構建指南和基準測試結果已公開發布於https://github.com/PALIN2018/BrowseComp-ZH。
English
As large language models (LLMs) evolve into tool-using agents, the ability to
browse the web in real-time has become a critical yardstick for measuring their
reasoning and retrieval competence. Existing benchmarks such as BrowseComp
concentrate on English and overlook the linguistic, infrastructural, and
censorship-related complexities of other major information ecosystems -- most
notably Chinese. To address this gap, we introduce BrowseComp-ZH, a
high-difficulty benchmark purpose-built to comprehensively evaluate LLM agents
on the Chinese web. BrowseComp-ZH consists of 289 multi-hop questions spanning
11 diverse domains. Each question is reverse-engineered from a short,
objective, and easily verifiable answer (e.g., a date, number, or proper noun).
A two-stage quality control protocol is applied to strive for high question
difficulty and answer uniqueness. We benchmark over 20 state-of-the-art
language models and agentic search systems on our proposed BrowseComp-ZH.
Despite their strong conversational and retrieval capabilities, most models
struggle severely: a large number achieve accuracy rates below 10%, and only a
handful exceed 20%. Even the best-performing system, OpenAI's DeepResearch,
reaches just 42.9%. These results demonstrate the considerable difficulty of
BrowseComp-ZH, where success demands not only effective retrieval strategies,
but also sophisticated reasoning and information reconciliation -- capabilities
that current models still struggle to master. Our dataset, construction
guidelines, and benchmark results have been publicly released at
https://github.com/PALIN2018/BrowseComp-ZH.Summary
AI-Generated Summary