OccuBench:透過語言世界模型評估AI代理在真實世界專業任務上的表現
OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language World Models
April 13, 2026
作者: Xiaomeng Hu, Yinger Zhang, Fei Huang, Jianhong Tu, Yang Su, Lianghao Deng, Yuxuan Liu, Yantao Liu, Dayiheng Liu, Tsung-Yi Ho
cs.AI
摘要
AI代理預期將在數百個職業領域(從急診分診到核反應爐安全監控,再到海關進口處理)執行專業工作,然而現有基準僅能在少數存在公共環境的領域中評估代理。我們推出OccuBench——一個涵蓋10大行業類別、65個專業領域共100個真實職業任務場景的基準,該基準通過語言世界模型(LWM)實現,利用LLM驅動的工具響應生成來模擬領域特定環境。我們的多代理合成管道能自動生成具備可解性保證、難度校準和文檔錨定多樣性的評估實例。OccuBench從兩個互補維度評估代理:跨專業領域的任務完成度,以及在受控故障注入(顯性錯誤、隱性數據劣化與混合故障)下的環境魯棒性。我們評估了8個模型家族的15個前沿模型,發現:(1)沒有單一模型能在所有行業佔優,各自具備獨特的職業能力譜;(2)隱性故障(數據截斷、字段缺失)比顯性錯誤(超時、500狀態碼)和混合故障更難處理,因其缺乏明顯錯誤信號,要求代理自主檢測數據劣化;(3)更大模型、更新代際及更高推理投入能持續提升性能,GPT-5.2從最低到最高推理投入時性能提升27.5分;(4)強力代理未必是優秀的環境模擬器,模擬器質量對基於LWM的評估可靠性至關重要。OccuBench首次為AI代理在專業職業任務上提供了系統化的跨行業評估框架。
English
AI agents are expected to perform professional work across hundreds of occupational domains (from emergency department triage to nuclear reactor safety monitoring to customs import processing), yet existing benchmarks can only evaluate agents in the few domains where public environments exist. We introduce OccuBench, a benchmark covering 100 real-world professional task scenarios across 10 industry categories and 65 specialized domains, enabled by Language World Models (LWMs) that simulate domain-specific environments through LLM-driven tool response generation. Our multi-agent synthesis pipeline automatically produces evaluation instances with guaranteed solvability, calibrated difficulty, and document-grounded diversity. OccuBench evaluates agents along two complementary dimensions: task completion across professional domains and environmental robustness under controlled fault injection (explicit errors, implicit data degradation, and mixed faults). We evaluate 15 frontier models across 8 model families and find that: (1) no single model dominates all industries, as each has a distinct occupational capability profile; (2) implicit faults (truncated data, missing fields) are harder than both explicit errors (timeouts, 500s) and mixed faults, because they lack overt error signals and require the agent to independently detect data degradation; (3) larger models, newer generations, and higher reasoning effort consistently improve performance. GPT-5.2 improves by 27.5 points from minimal to maximum reasoning effort; and (4) strong agents are not necessarily strong environment simulators. Simulator quality is critical for LWM-based evaluation reliability. OccuBench provides the first systematic cross-industry evaluation of AI agents on professional occupational tasks.