KWBench：衡量知识工作中无提示问题识别能力

摘要

我们正式推出首版KWBench（知识工作平台），这是针对大型语言模型无提示问题识别能力的基准测试：考察LLM在尝试解决问题前能否先识别出专业场景。现有前沿基准测试已趋饱和，且当前多数知识工作评估最终都简化为依据既定规范的提取或任务完成度测试。KWBench则瞄准了更前置的环节：仅从原始输入中识别情境的核心治理结构。该基准包含223项任务，来源涵盖采购、合同谈判、临床药学、组织政治、欺诈分析和激励设计等领域的从业者。每项任务均编码了形式化的博弈论模式（委托代理冲突、信号传递、机制设计失效、策略性信息隐瞒、联盟动态、策略互依性），并携带结构化标注数据，记录专家对情境的解读及预期失效模式。模型接收原始数据和无问题类型提示的任务指令，评分采用三级量规并设定了强制性联合检验门槛——强制标准编码了预测的错误路径。我们对16个模型进行评估，最佳模型仅通过27.9%的任务。排名前二的模型在通过任务中仅有31.7%的一致性。前八名模型中，有44项任务仅被单一模型解决；若采用前八名模型协同路由策略，可覆盖基准50.7%的任务，近乎最佳单模型的两倍。在通过测试的条件下，各模型质量得分趋同（约83%）；但无条件得分则呈现分化。部分模型在被询问时可准确阐述相关博弈论概念，却在无提示条件下无法自主应用。我们发布KWBench旨在革新知识工作的前沿模型评估方式，重点考察模型能否仅从情境中识别正确问题，而不仅限于在问题被明确框定后的执行能力。

English

We introduce the first version of KWBench (Knowledge Work Bench), a benchmark for unprompted problem recognition in large language models: can an LLM identify a professional scenario before attempting to solve it. Existing frontier benchmarks have saturated, and most knowledge-work evaluations to date reduce to extraction or task completion against a specification. KWBench targets the step before that: recognizing the governing structure of the situation from raw inputs alone. The benchmark contains 223 tasks sourced from practitioners across acquisitions, contract negotiations, clinical pharmacy, organizational politics, fraud analysis, and incentive design. Each task encodes a formal game-theoretic pattern (principal-agent conflict, signaling, mechanism design failure, strategic omission, coalitional dynamics, strategic interdependence) and carries structured ground truth recording the expert reading of the situation and the anticipated failure modes. Models receive raw data and a task prompt with no indication of problem type. Scoring is a three-tier rubric gated by a mandatory conjunctive check. Mandatory criteria encode the predicted wrong paths. We evaluate 16 models. The best model passes on 27.9% of tasks. The top two models agree on only 31.7% of their passes. Among the top 8, 44 tasks are solved by exactly one model; routing across the top 8 covers 50.7% of the benchmark, nearly double the best single model. Conditional on passing, quality scores converge (approx 83% across models); unconditional scores do not. Same models articulate the relevant game-theoretic concept correctly when asked, then fail to apply it unprompted. We release KWBench to shift how frontier models are evaluated on knowledge work, scoring them on whether they recognize the right problem from the situation alone, not only on how well they execute once the problem has been framed for them.

KWBench：衡量知识工作中无提示问题识别能力

KWBench: Measuring Unprompted Problem Recognition in Knowledge Work

摘要

Support