KWBench: ナレッジワークにおける非指示的問題認識の測定

要旨

我々は、大規模言語モデルにおけるプロンプトなし問題認識のベンチマークであるKWBench（Knowledge Work Bench）の初版を紹介する。既存のフロンティアベンチマークは飽和状態にあり、これまでの知識労働評価のほとんどは仕様に基づく情報抽出やタスク完遂に還元されてきた。KWBenchはその前段階、すなわち生の入力のみから状況の支配的構造を認識できるかに焦点を当てる。本ベンチマークは、調達、契約交渉、臨床薬学、組織政治、不正分析、インセンティブ設計など、実務家から収集した223のタスクで構成される。各タスクは形式的なゲーム理論的パターン（プリンシパル・エージェント問題、シグナリング、メカニズムデザインの失敗、戦略的省略、連合ダイナミクス、戦略的相互依存）を符号化し、専門家による状況解釈と予想される失敗モードを構造化された正解データとして保持する。モデルは、問題タイプの示されない生データとタスクプロンプトを受け取る。評価は、必須の共役チェックを通過要件とする3段階のルーブリックで行う。必須基準は、予測される誤ったアプローチを符号化している。 16のモデルを評価した結果、最高性能のモデルでもタスクの27.9%に合格した。上位2モデルの合格一致率は31.7%に留まった。上位8モデルでは、44タスクがちょうど1つのモデルにのみ解決され、上位8モデルをルーティングすることでベンチマークの50.7%をカバーでき、最高単体モデルの約2倍に達した。合格時における品質スコアは収束した（モデル間で約83%）が、無条件のスコアは収束しなかった。同一モデルは、質問されれば関連するゲーム理論的概念を正確に説明できるにもかかわらず、プロンプトなしではそれを適用できない。我々はKWBenchを公開し、フロンティアモデルの知識労働評価の在り方を転換する。すなわち、問題が枠組みを与えられた後にどれだけうまく実行するかだけでなく、状況のみから正しい問題を認識できるかどうかでモデルを評価するのである。

English

We introduce the first version of KWBench (Knowledge Work Bench), a benchmark for unprompted problem recognition in large language models: can an LLM identify a professional scenario before attempting to solve it. Existing frontier benchmarks have saturated, and most knowledge-work evaluations to date reduce to extraction or task completion against a specification. KWBench targets the step before that: recognizing the governing structure of the situation from raw inputs alone. The benchmark contains 223 tasks sourced from practitioners across acquisitions, contract negotiations, clinical pharmacy, organizational politics, fraud analysis, and incentive design. Each task encodes a formal game-theoretic pattern (principal-agent conflict, signaling, mechanism design failure, strategic omission, coalitional dynamics, strategic interdependence) and carries structured ground truth recording the expert reading of the situation and the anticipated failure modes. Models receive raw data and a task prompt with no indication of problem type. Scoring is a three-tier rubric gated by a mandatory conjunctive check. Mandatory criteria encode the predicted wrong paths. We evaluate 16 models. The best model passes on 27.9% of tasks. The top two models agree on only 31.7% of their passes. Among the top 8, 44 tasks are solved by exactly one model; routing across the top 8 covers 50.7% of the benchmark, nearly double the best single model. Conditional on passing, quality scores converge (approx 83% across models); unconditional scores do not. Same models articulate the relevant game-theoretic concept correctly when asked, then fail to apply it unprompted. We release KWBench to shift how frontier models are evaluated on knowledge work, scoring them on whether they recognize the right problem from the situation alone, not only on how well they execute once the problem has been framed for them.

KWBench: ナレッジワークにおける非指示的問題認識の測定

KWBench: Measuring Unprompted Problem Recognition in Knowledge Work

要旨

Support