K-BrowseComp: 한국 맥락에 기반한 웹 브라우징 에이전트 벤치마크

초록

최첨단 모델 평가는 기초 능력(예: 명령 수행 및 추론)에서 조합적·에이전트적 능력으로 전환되고 있지만, 한국어 기반 에이전트 벤치마크는 여전히 부족한 실정이다. 본 논문에서는 한국어 맥락에 기반한 웹 브라우징 에이전트 벤치마크인 K-BrowseComp를 소개하며, 이는 400개의 문제로 구성된다. 300개 문제로 이루어진 K-BrowseComp-Verified 하위 집합은 한국어 원어민에 의해 수작업으로 구성 및 검증되었다. 이 하위 집합에서 GPT-5.5, DeepSeek-V4-Pro, GLM-5.1을 포함한 최첨단 LLM은 30.00~45.67%의 성능만을 보여 BrowseComp 대비 현저히 낮은 성능을 기록했으며, 한국 자체 AI 기초모델 프로그램을 통해 공개된 한국어 LLM은 0.00~10.33%에 그쳤다. 또한 웹 브라우징 문제의 풀이와 생성 간 비대칭성을 활용하기 위해, 어려운 소수 샷 예시와 실패 모드 대상 생성을 사용하여 100개 문제의 합성 분할을 추가로 구축하였다. 적대적 필터링을 거친 합성 진단 분할에서 가장 강력한 모델조차 26.00%의 성능만을 달성했으며, 본 분할은 목표 지향적 스트레스 테스트로 별도로 보고한다. 데이터와 코드는 공개한다.

English

Frontier model evaluations are shifting from foundational capabilities (e.g., instruction following and reasoning) toward compositional, agentic ones, but Korean agentic benchmarks remain scarce. We introduce K-BrowseComp, a web-browsing agent benchmark grounded in Korean contexts, consisting of 400 problems. The 300-problem K-BrowseComp-Verified subset is manually constructed and validated by native Korean speakers. On this subset, frontier LLMs, including GPT-5.5, DeepSeek-V4-Pro, and GLM-5.1, reach only 30.00--45.67\%, a substantial drop from BrowseComp, while Korean LLMs released through Korea's Proprietary AI Foundation Model program obtain only 0.00--10.33\%. We further construct a 100-problem synthetic split using hard few-shot exemplars and failure-mode-targeted generation to exploit the asymmetry between solving and creating web browsing problems. On the adversarially filtered synthetic diagnostic split, the strongest model reaches only 26.00\%, and we report this split separately as a targeted stress test. We publicly release our data and code.