OccuBench: 언어 세계 모델을 통한 실제 전문직 업무에 대한 AI 에이전트 평가

초록

AI 에이전트는 수백 개의 직업 영역(응급실 환자 분류부터 원자로 안전 모니터링, 관세 신고 처리까지)에서 전문 업무를 수행할 것으로 기대되지만, 기존 벤치마크는 공개 환경이 존재하는 소수 영역에서만 에이전트를 평가할 수 있습니다. 우리는 언어 세계 모델(LWM)을 통해 LLM 기반 도구 응답 생성으로 도메인 특화 환경을 시뮬레이션하여, 10개 산업 범주와 65개 전문 분야에 걸친 100개의 실제 전문 작업 시나리오를 포괄하는 벤치마크인 OccuBench을 소개합니다. 우리의 다중 에이전트 합성 파이프라인은 해결 가능성 보장, 조정된 난이도, 문서 기반 다양성을 갖춘 평가 인스턴스를 자동으로 생성합니다. OccuBench은 두 가지 상호 보완적인 차원에서 에이전트를 평가합니다: 전문 분야별 작업 완수도와 통제된 오류 주입(명시적 오류, 암묵적 데이터 열화, 복합 오류) 하의 환경 견고성입니다. 우리는 8개 모델 패밀리의 15개 최신 모델을 평가하여 다음과 같은 사실을 발견했습니다: (1) 단일 모델이 모든 산업을 지배하지 않으며, 각 모델마다 고유한 직무 역량 프로필을 보유합니다; (2) 암묵적 오류(잘림 데이터, 누락 필드)는 명시적 오류(시간 초과, 500 오류) 및 복합 오류보다 어렵습니다. 이는 명확한 오류 신호가 부족하고 에이전트가 데이터 열화를 독자적으로 탐지해야 하기 때문입니다; (3) 더 큰 모델, 최신 세대, 높은 추론 노력은 일관되게 성능을 향상시킵니다. GPT-5.2는 최소 추론 노력 대비 최대 추론 노력 시 27.5점이 향상됩니다; (4) 강력한 에이전트가 반드시 강력한 환경 시뮬레이터는 아닙니다. 시뮬레이터 품질은 LWM 기반 평가 신뢰성에 중요합니다. OccuBench은 전문 직무 작업에 대한 AI 에이전트의 첫 번째 체계적인 크로스-인더스트리 평가를 제공합니다.

English

AI agents are expected to perform professional work across hundreds of occupational domains (from emergency department triage to nuclear reactor safety monitoring to customs import processing), yet existing benchmarks can only evaluate agents in the few domains where public environments exist. We introduce OccuBench, a benchmark covering 100 real-world professional task scenarios across 10 industry categories and 65 specialized domains, enabled by Language World Models (LWMs) that simulate domain-specific environments through LLM-driven tool response generation. Our multi-agent synthesis pipeline automatically produces evaluation instances with guaranteed solvability, calibrated difficulty, and document-grounded diversity. OccuBench evaluates agents along two complementary dimensions: task completion across professional domains and environmental robustness under controlled fault injection (explicit errors, implicit data degradation, and mixed faults). We evaluate 15 frontier models across 8 model families and find that: (1) no single model dominates all industries, as each has a distinct occupational capability profile; (2) implicit faults (truncated data, missing fields) are harder than both explicit errors (timeouts, 500s) and mixed faults, because they lack overt error signals and require the agent to independently detect data degradation; (3) larger models, newer generations, and higher reasoning effort consistently improve performance. GPT-5.2 improves by 27.5 points from minimal to maximum reasoning effort; and (4) strong agents are not necessarily strong environment simulators. Simulator quality is critical for LWM-based evaluation reliability. OccuBench provides the first systematic cross-industry evaluation of AI agents on professional occupational tasks.

OccuBench: 언어 세계 모델을 통한 실제 전문직 업무에 대한 AI 에이전트 평가

OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language World Models

초록

Support