OccuBench: 言語世界モデルによる現実世界の専門的タスクへのAIエージェント評価

要旨

AIエージェントは数百の職業領域（救急部門のトリアージから原子炉安全監視、税関輸入処理まで）で専門的な業務を遂行することが期待されているが、既存のベンチマークは公開環境が存在する限られた領域でのみ評価可能である。本論文ではOccuBenchを紹介する。これは10の産業カテゴリ、65の専門領域にわたる100の現実的な職業タスクシナリオをカバーするベンチマークであり、LLM駆動のツール応答生成を通じて領域特化的環境をシミュレートするLanguage World Models（LWM）によって実現されている。当社のマルチエージェント合成パイプラインは、解決可能性の保証、較正された難易度、文書に基づく多様性を備えた評価インスタンスを自動生成する。OccuBenchはエージェントを2つの相補的次元で評価する：専門領域横断的なタスク完遂度と、制御された障害注入（明示的エラー、暗黙的データ劣化、混合障害）下での環境ロバスト性である。8モデルファミリーにわたる15の先進モデルを評価した結果、（1）単一モデルが全産業を支配するものはなく、各モデルが独自の職業能力プロファイルを持つ；（2）暗黙的障害（データ断片化、フィールド欠落）は、明示的エラー（タイムアウト、500エラー）や混合障害よりも難易度が高く、これは明瞭なエラー信号を欠き、エージェントが自律的にデータ劣化を検出する必要があるため；（3）大規模モデル、新しい世代、高い推論努力が一貫して性能向上をもたらす（GPT-5.2は最小から最大の推論努力で27.5ポイント向上）；（4）強力なエージェントが必ずしも優れた環境シミュレーターとは限らない（シミュレーター品質はLWMベース評価の信頼性に重要）という知見が得られた。OccuBenchは職業的タスクにおけるAIエージェントの初の体系的な産業横断評価を提供する。

English

AI agents are expected to perform professional work across hundreds of occupational domains (from emergency department triage to nuclear reactor safety monitoring to customs import processing), yet existing benchmarks can only evaluate agents in the few domains where public environments exist. We introduce OccuBench, a benchmark covering 100 real-world professional task scenarios across 10 industry categories and 65 specialized domains, enabled by Language World Models (LWMs) that simulate domain-specific environments through LLM-driven tool response generation. Our multi-agent synthesis pipeline automatically produces evaluation instances with guaranteed solvability, calibrated difficulty, and document-grounded diversity. OccuBench evaluates agents along two complementary dimensions: task completion across professional domains and environmental robustness under controlled fault injection (explicit errors, implicit data degradation, and mixed faults). We evaluate 15 frontier models across 8 model families and find that: (1) no single model dominates all industries, as each has a distinct occupational capability profile; (2) implicit faults (truncated data, missing fields) are harder than both explicit errors (timeouts, 500s) and mixed faults, because they lack overt error signals and require the agent to independently detect data degradation; (3) larger models, newer generations, and higher reasoning effort consistently improve performance. GPT-5.2 improves by 27.5 points from minimal to maximum reasoning effort; and (4) strong agents are not necessarily strong environment simulators. Simulator quality is critical for LWM-based evaluation reliability. OccuBench provides the first systematic cross-industry evaluation of AI agents on professional occupational tasks.

OccuBench: 言語世界モデルによる現実世界の専門的タスクへのAIエージェント評価

OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language World Models

要旨

Support