ChatPaper.aiChatPaper

K-BrowseComp:基於韓國情境的網頁瀏覽智能體基準測試

K-BrowseComp: A Web Browsing Agent Benchmark Grounded in Korean Contexts

June 1, 2026
作者: Nahyun Lee, Dongkeun Yoon, Guijin Son, Geewook Kim, Dayoon Ko, Jeonghun Park, Haneul Yoo, Jaewon Cho, Junghun Park, Changyoon Lee, Kyochul Jang, Jaeyeon Kim, Eunsu Kim, Woojin Cho, Seungone Kim
cs.AI

摘要

前沿模型的評估正從基礎能力(例如指令遵循與推理)轉向組合性、代理性的能力,但韓語的代理性基準依然稀少。我們提出 K-BrowseComp,這是一個基於韓語情境的網頁瀏覽代理基準,包含 400 道問題。其中含 300 道問題的 K-BrowseComp-Verified 子集由母語為韓語的使用者手動建構與驗證。在此子集上,前沿大型語言模型,包括 GPT-5.5、DeepSeek-V4-Pro 與 GLM-5.1,僅達到 30.00% 至 45.67%,相較於 BrowseComp 大幅下滑;而透過韓國專有 AI 基礎模型計劃發布的韓語大型語言模型僅獲得 0.00% 至 10.33%。我們進一步利用困難的少量範例與針對失敗模式的生成,建構了一個含 100 道問題的合成分區,以利用解決與創造網頁瀏覽問題之間的不對稱性。在經過對抗性過濾的合成診斷性分區上,最強的模型僅達到 26.00%,我們將此分區單獨報告作為一項針對性的壓力測試。我們公開釋出我們的資料與程式碼。
English
Frontier model evaluations are shifting from foundational capabilities (e.g., instruction following and reasoning) toward compositional, agentic ones, but Korean agentic benchmarks remain scarce. We introduce K-BrowseComp, a web-browsing agent benchmark grounded in Korean contexts, consisting of 400 problems. The 300-problem K-BrowseComp-Verified subset is manually constructed and validated by native Korean speakers. On this subset, frontier LLMs, including GPT-5.5, DeepSeek-V4-Pro, and GLM-5.1, reach only 30.00--45.67\%, a substantial drop from BrowseComp, while Korean LLMs released through Korea's Proprietary AI Foundation Model program obtain only 0.00--10.33\%. We further construct a 100-problem synthetic split using hard few-shot exemplars and failure-mode-targeted generation to exploit the asymmetry between solving and creating web browsing problems. On the adversarially filtered synthetic diagnostic split, the strongest model reaches only 26.00\%, and we report this split separately as a targeted stress test. We publicly release our data and code.