K-BrowseComp：一个基于韩国语境的网页浏览代理基准

摘要

前沿模型评估正从基础能力（如指令遵循与推理）转向组合型、智能体型能力，但韩语智能体基准仍然稀缺。我们提出K-BrowseComp，一个基于韩语场景的网络浏览智能体基准，包含400个问题。其中300题的K-BrowseComp-Verified子集由母语为韩语的研究人员手工构建并验证。在该子集上，包括GPT-5.5、DeepSeek-V4-Pro和GLM-5.1在内的前沿大语言模型仅达到30.00%至45.67%，相较于BrowseComp大幅下降；而通过韩国专有AI基础模型项目发布的韩语大语言模型仅获得0.00%至10.33%。我们进一步利用硬样本的少量示例和失败模式导向生成，构建了一个100题的合成子集，以利用解决与创建网络浏览问题之间的不对称性。在对抗性过滤后的合成诊断子集上，最强模型仅达到26.00%，我们将该子集作为针对性压力测试单独报告。我们公开了数据和代码。

English

Frontier model evaluations are shifting from foundational capabilities (e.g., instruction following and reasoning) toward compositional, agentic ones, but Korean agentic benchmarks remain scarce. We introduce K-BrowseComp, a web-browsing agent benchmark grounded in Korean contexts, consisting of 400 problems. The 300-problem K-BrowseComp-Verified subset is manually constructed and validated by native Korean speakers. On this subset, frontier LLMs, including GPT-5.5, DeepSeek-V4-Pro, and GLM-5.1, reach only 30.00--45.67\%, a substantial drop from BrowseComp, while Korean LLMs released through Korea's Proprietary AI Foundation Model program obtain only 0.00--10.33\%. We further construct a 100-problem synthetic split using hard few-shot exemplars and failure-mode-targeted generation to exploit the asymmetry between solving and creating web browsing problems. On the adversarially filtered synthetic diagnostic split, the strongest model reaches only 26.00\%, and we report this split separately as a targeted stress test. We publicly release our data and code.