Ko-WideSearch: ウェブエージェントによる網羅的集合列挙のための韓国語幅広検索ベンチマーク

要旨

Web-agentベンチマークは圧倒的に深さを測定する——一連の制約の背後にある一つの不明瞭な回答を固定する——一方、幅、すなわち閉じた集合を網羅的に列挙し各項目の属性を埋めることは、特に英語以外ではほとんど評価されていない。幅の構築もまた困難である。金標準セットが完全であり各セルが正しいことを保証することは、単一の回答をチェックするよりもはるかにコストがかかる。本稿では、自動化された合成・検証パイプラインによって構築された韓国語幅探索ベンチマークであるKo-WideSearchを導入する。各タスクはセットの親エンティティ（テレビシーズン、王朝、リーグ、行政区画、選挙）を指定し、その完全なメンバーシップと項目ごとの属性テーブルを要求し、Item-F1、Column-F1、Row-F1で採点される。本ベンチマークは190のエンティティと16のカテゴリにわたる228のテーブルから構成され、3つの難易度階層に分かれている。難易度は、私が独立に調整する2つの構造的ノブ（テーブル幅と2次元複合キー）によって設定され、階層をまたがって直積メンバーシップが0%から100%に増加する。単一の正規化対応比較器が金標準構築と採点の両方で共有されているため、安定した日付やカウント列が書式のみに基づいて過剰に落とされることはない。20のWebエージェント全体で、失敗は一貫している。エージェントはセットを復元できるが行は復元できない（例えばItem-F1 92.8に対してRow-F1 53.7）。ノブが厳しくなるにつれて精度は着実に低下し、検索の増加も支出の増加もその差を埋められない。セルごとに分解すると、難しい部分は正しい値を見つけることであり、書式設定ではない。自由形式の自由記述セルが最も失敗し、日付や名前などの標準的な回答があるセルは通常正しく出力される。

English

Web-agent benchmarks overwhelmingly measure depth -- pinning one obscure answer behind a chain of constraints -- while breadth, exhaustively enumerating a closed set and filling each item's attributes, is barely evaluated, especially outside English. Breadth is also hard to build: certifying that a gold set is complete and every cell correct is far costlier than checking a single answer. I introduce Ko-WideSearch, a Korean breadth-search benchmark built by an automated synthesize-and-verify pipeline. Each task names a set-parent entity -- a TV season, a dynasty, a league, an administrative region, an election -- and asks for its full membership plus a per-item attribute table, graded by Item-, Column-, and Row-F1. It spans 228 tables over 190 entities and sixteen categories across three difficulty tiers, set by two structural knobs I dial independently -- table width and a 2-D composite key -- so cross-product membership climbs from 0\% to 100\% across the tiers. A single normalization-aware comparator is shared between gold construction and grading, so stable date and count columns are not over-dropped on formatting alone. Across twenty web agents, the failure is consistent: agents recover the set but not the rows (e.g.\ Item-F1 92.8 against Row-F1 53.7), accuracy falls steadily as the knobs harden, and neither more search nor more spend closes the gap. Broken down by cell, the hard part is finding the right value, not formatting it: open-ended free-text cells fail most, while cells with a standard answer such as a date or a name usually come out right.