K-BrowseComp: 韓国コンテキストに根ざしたWebブラウジングエージェントベンチマーク

要旨

フロンティアモデルの評価は、指示追従や推論といった基礎的能力から、合成的でエージェント的な能力へとシフトしつつあるが、韓国語に特化したエージェントベンチマークは依然として不足している。本研究では、韓国語の文脈に基づくWebブラウジングエージェント用ベンチマーク「K-BrowseComp」を導入する。これは400問から構成される。そのうちの300問からなる「K-BrowseComp-Verified」サブセットは、韓国語母語話者により手作業で構築・検証された。このサブセットにおいて、GPT-5.5、DeepSeek-V4-Pro、GLM-5.1といったフロンティアLLMの正解率は30.00～45.67%にとどまり、BrowseCompから大幅に低下している。一方、韓国の独自AI基盤モデルプログラムを通じて公開された韓国語LLMは0.00～10.33%しか達成していない。さらに、Webブラウジング問題の解決と作成の非対称性を活用するため、難易度の高い少数例示と失敗モードに焦点を当てた生成を用いて、100問の合成分割を構築した。敵対的フィルタリングを施した合成診断分割では、最も強力なモデルでも26.00%の正解率に留まり、この分割は対象を絞ったストレステストとして別途報告する。データとコードは公開する。

English

Frontier model evaluations are shifting from foundational capabilities (e.g., instruction following and reasoning) toward compositional, agentic ones, but Korean agentic benchmarks remain scarce. We introduce K-BrowseComp, a web-browsing agent benchmark grounded in Korean contexts, consisting of 400 problems. The 300-problem K-BrowseComp-Verified subset is manually constructed and validated by native Korean speakers. On this subset, frontier LLMs, including GPT-5.5, DeepSeek-V4-Pro, and GLM-5.1, reach only 30.00--45.67\%, a substantial drop from BrowseComp, while Korean LLMs released through Korea's Proprietary AI Foundation Model program obtain only 0.00--10.33\%. We further construct a 100-problem synthetic split using hard few-shot exemplars and failure-mode-targeted generation to exploit the asymmetry between solving and creating web browsing problems. On the adversarially filtered synthetic diagnostic split, the strongest model reaches only 26.00\%, and we report this split separately as a targeted stress test. We publicly release our data and code.