ChatPaper.aiChatPaper

K-BrowseComp:一个基于韩国语境的网页浏览代理基准

K-BrowseComp: A Web Browsing Agent Benchmark Grounded in Korean Contexts

June 1, 2026
作者: Nahyun Lee, Dongkeun Yoon, Guijin Son, Geewook Kim, Dayoon Ko, Jeonghun Park, Haneul Yoo, Jaewon Cho, Junghun Park, Changyoon Lee, Kyochul Jang, Jaeyeon Kim, Eunsu Kim, Woojin Cho, Seungone Kim
cs.AI

摘要

前沿模型评估正从基础能力(如指令遵循与推理)转向组合型、智能体型能力,但韩语智能体基准仍然稀缺。我们提出K-BrowseComp,一个基于韩语场景的网络浏览智能体基准,包含400个问题。其中300题的K-BrowseComp-Verified子集由母语为韩语的研究人员手工构建并验证。在该子集上,包括GPT-5.5、DeepSeek-V4-Pro和GLM-5.1在内的前沿大语言模型仅达到30.00%至45.67%,相较于BrowseComp大幅下降;而通过韩国专有AI基础模型项目发布的韩语大语言模型仅获得0.00%至10.33%。我们进一步利用硬样本的少量示例和失败模式导向生成,构建了一个100题的合成子集,以利用解决与创建网络浏览问题之间的不对称性。在对抗性过滤后的合成诊断子集上,最强模型仅达到26.00%,我们将该子集作为针对性压力测试单独报告。我们公开了数据和代码。
English
Frontier model evaluations are shifting from foundational capabilities (e.g., instruction following and reasoning) toward compositional, agentic ones, but Korean agentic benchmarks remain scarce. We introduce K-BrowseComp, a web-browsing agent benchmark grounded in Korean contexts, consisting of 400 problems. The 300-problem K-BrowseComp-Verified subset is manually constructed and validated by native Korean speakers. On this subset, frontier LLMs, including GPT-5.5, DeepSeek-V4-Pro, and GLM-5.1, reach only 30.00--45.67\%, a substantial drop from BrowseComp, while Korean LLMs released through Korea's Proprietary AI Foundation Model program obtain only 0.00--10.33\%. We further construct a 100-problem synthetic split using hard few-shot exemplars and failure-mode-targeted generation to exploit the asymmetry between solving and creating web browsing problems. On the adversarially filtered synthetic diagnostic split, the strongest model reaches only 26.00\%, and we report this split separately as a targeted stress test. We publicly release our data and code.