K-BrowseComp：基於韓國情境的網頁瀏覽智能體基準測試

摘要

前沿模型的評估正從基礎能力（例如指令遵循與推理）轉向組合性、代理性的能力，但韓語的代理性基準依然稀少。我們提出 K-BrowseComp，這是一個基於韓語情境的網頁瀏覽代理基準，包含 400 道問題。其中含 300 道問題的 K-BrowseComp-Verified 子集由母語為韓語的使用者手動建構與驗證。在此子集上，前沿大型語言模型，包括 GPT-5.5、DeepSeek-V4-Pro 與 GLM-5.1，僅達到 30.00% 至 45.67%，相較於 BrowseComp 大幅下滑；而透過韓國專有 AI 基礎模型計劃發布的韓語大型語言模型僅獲得 0.00% 至 10.33%。我們進一步利用困難的少量範例與針對失敗模式的生成，建構了一個含 100 道問題的合成分區，以利用解決與創造網頁瀏覽問題之間的不對稱性。在經過對抗性過濾的合成診斷性分區上，最強的模型僅達到 26.00%，我們將此分區單獨報告作為一項針對性的壓力測試。我們公開釋出我們的資料與程式碼。

English

Frontier model evaluations are shifting from foundational capabilities (e.g., instruction following and reasoning) toward compositional, agentic ones, but Korean agentic benchmarks remain scarce. We introduce K-BrowseComp, a web-browsing agent benchmark grounded in Korean contexts, consisting of 400 problems. The 300-problem K-BrowseComp-Verified subset is manually constructed and validated by native Korean speakers. On this subset, frontier LLMs, including GPT-5.5, DeepSeek-V4-Pro, and GLM-5.1, reach only 30.00--45.67\%, a substantial drop from BrowseComp, while Korean LLMs released through Korea's Proprietary AI Foundation Model program obtain only 0.00--10.33\%. We further construct a 100-problem synthetic split using hard few-shot exemplars and failure-mode-targeted generation to exploit the asymmetry between solving and creating web browsing problems. On the adversarially filtered synthetic diagnostic split, the strongest model reaches only 26.00\%, and we report this split separately as a targeted stress test. We publicly release our data and code.