OccuBench：通过语言世界模型评估AI代理在真实世界专业任务上的表现

摘要

人工智能代理被期待在数百种职业领域（从急诊分诊到核反应堆安全监控再到海关进口处理）执行专业工作，然而现有基准只能评估存在公共环境的少数领域。我们推出OccuBench基准，涵盖10个行业类别、65个专业领域的100个真实世界专业任务场景，通过语言世界模型（LWM）利用LLM驱动的工具响应生成来模拟领域特定环境。我们的多智能体合成流程能自动生成具有可解性保证、难度校准和文档 grounded 多样性的评估实例。OccuBench从两个互补维度评估智能体：跨专业领域的任务完成能力，以及受控故障注入（显性错误、隐性数据退化与混合故障）下的环境鲁棒性。我们对8个模型系列的15个前沿模型进行评估发现：（1）没有单一模型能在所有行业领先，各自具有独特的职业能力图谱；（2）隐性故障（数据截断、字段缺失）比显性错误（超时、500状态码）和混合故障更具挑战，因其缺乏明确错误信号且需智能体自主检测数据退化；（3）模型规模扩大、代际更新和推理投入增加能持续提升性能——GPT-5.2从最小到最大推理投入时性能提升27.5分；（4）强智能体不一定是优秀的环境模拟器，模拟器质量对基于LWM的评估可靠性至关重要。OccuBench首次实现了对AI代理在专业职业任务上的系统性跨行业评估。

English

AI agents are expected to perform professional work across hundreds of occupational domains (from emergency department triage to nuclear reactor safety monitoring to customs import processing), yet existing benchmarks can only evaluate agents in the few domains where public environments exist. We introduce OccuBench, a benchmark covering 100 real-world professional task scenarios across 10 industry categories and 65 specialized domains, enabled by Language World Models (LWMs) that simulate domain-specific environments through LLM-driven tool response generation. Our multi-agent synthesis pipeline automatically produces evaluation instances with guaranteed solvability, calibrated difficulty, and document-grounded diversity. OccuBench evaluates agents along two complementary dimensions: task completion across professional domains and environmental robustness under controlled fault injection (explicit errors, implicit data degradation, and mixed faults). We evaluate 15 frontier models across 8 model families and find that: (1) no single model dominates all industries, as each has a distinct occupational capability profile; (2) implicit faults (truncated data, missing fields) are harder than both explicit errors (timeouts, 500s) and mixed faults, because they lack overt error signals and require the agent to independently detect data degradation; (3) larger models, newer generations, and higher reasoning effort consistently improve performance. GPT-5.2 improves by 27.5 points from minimal to maximum reasoning effort; and (4) strong agents are not necessarily strong environment simulators. Simulator quality is critical for LWM-based evaluation reliability. OccuBench provides the first systematic cross-industry evaluation of AI agents on professional occupational tasks.

OccuBench：通过语言世界模型评估AI代理在真实世界专业任务上的表现

OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language World Models

摘要

Support