KWBench：衡量知识工作中无提示问题识别能力

摘要

我们推出首版KWBench（知识工作台）——一个专注于大语言模型无提示问题识别能力的基准测试框架：评估LLM能否在尝试解决问题前先识别出专业场景。现有前沿基准测试已趋饱和，且当前多数知识工作评估简化为依据规范进行信息提取或任务完成。KWBench瞄准的是前置环节：仅从原始输入中识别情境的核心治理结构。该基准包含223项任务，源自采购、合同谈判、临床药学、组织政治、欺诈分析和激励机制设计等领域的从业者实践。每项任务均编码了形式化的博弈论模式（委托代理冲突、信号传递、机制设计失效、策略性隐瞒、联盟动态、战略互依），并携带结构化标注数据，记录专家对情境的解读及预期失效模式。模型接收原始数据和无问题类型提示的任务指令，评分采用包含强制合取条件的三级量规，强制标准编码了预测的错误路径。我们对16个模型进行评估，最佳模型仅通过27.9%的任务。前两名模型在通过任务中仅有31.7%的一致性。前八名模型中，44项任务仅被单一模型解决；采用前八名模型协同路由可覆盖基准50.7%的任务，近乎最佳单模型的两倍。在通过测试的条件下，各模型质量得分趋同（约83%）；但无条件得分差异显著。同一模型在被询问时可正确阐述相关博弈论概念，却在无提示时无法自主应用。我们发布KWBench以革新知识工作的前沿模型评估方式，重点考察模型能否仅凭情境自主识别正确问题，而非仅衡量其在问题被明确框定后的执行能力。

English

We introduce the first version of KWBench (Knowledge Work Bench), a benchmark for unprompted problem recognition in large language models: can an LLM identify a professional scenario before attempting to solve it. Existing frontier benchmarks have saturated, and most knowledge-work evaluations to date reduce to extraction or task completion against a specification. KWBench targets the step before that: recognizing the governing structure of the situation from raw inputs alone. The benchmark contains 223 tasks sourced from practitioners across acquisitions, contract negotiations, clinical pharmacy, organizational politics, fraud analysis, and incentive design. Each task encodes a formal game-theoretic pattern (principal-agent conflict, signaling, mechanism design failure, strategic omission, coalitional dynamics, strategic interdependence) and carries structured ground truth recording the expert reading of the situation and the anticipated failure modes. Models receive raw data and a task prompt with no indication of problem type. Scoring is a three-tier rubric gated by a mandatory conjunctive check. Mandatory criteria encode the predicted wrong paths. We evaluate 16 models. The best model passes on 27.9% of tasks. The top two models agree on only 31.7% of their passes. Among the top 8, 44 tasks are solved by exactly one model; routing across the top 8 covers 50.7% of the benchmark, nearly double the best single model. Conditional on passing, quality scores converge (approx 83% across models); unconditional scores do not. Same models articulate the relevant game-theoretic concept correctly when asked, then fail to apply it unprompted. We release KWBench to shift how frontier models are evaluated on knowledge work, scoring them on whether they recognize the right problem from the situation alone, not only on how well they execute once the problem has been framed for them.