OSWorld2.0: 長期にわたる実世界タスクにおけるコンピュータ操作エージェントのベンチマーク

要旨

既存のコンピュータ使用ベンチマークは、現実世界のコンピュータ使用におけるリアリティ、複雑性、および長期的な要求を捉えきれておらず、最先端エージェントの限界を明らかにする能力が制限されている。本稿では、日常的および専門的タスクにわたる108の長期コンピュータ使用ワークフローからなるベンチマーク「OSWorld 2.0」を紹介する。これは、複雑で困難な現実世界の現象を捉えるよう設計されている。各タスクは現実的なエンドツーエンドのワークフローを表し、人間のユーザーが完了するまでに中央値で約1.6時間を要し、Claude Opus 4.7で最大思考（maximum thinking）を用いた場合、平均約318回のツール呼び出しを必要とする（OSWorld 1.0では約30回）。OSWorld 2.0は、現実のワークフローでは一般的であるにもかかわらず、従来のベンチマークでは過小評価されてきた困難な現象をターゲットとしており、ストリーミングインタラクションや動的環境といったインタラクション設計上の課題や、クロスソース推論、暗黙的状態推論、視覚空間的精度といったエージェントパターンの課題にわたる。タスクは、真の入力成果物に基づき、現実的な状態を持つユーザープロファイルデータと相互参照され、安全性に敏感な実行を監査する個別の安全性レポートも含む。500ステップでの主要な二値完了率指標において、最大思考とバッチツール呼び出しを備えたClaude Opus 4.8が最高スコアを示すものの、完了したタスクは20.6%に過ぎず、部分スコアは54.8%である。GPT-5.5はトークン効率がはるかに高いが、約13%で頭打ちとなる。これらの結果は、現在のエージェントが専門家レベルのコンピュータ使用からは依然としてほど遠いことを示している。すなわち、基本的なGUI操作やコーディングでつまずくのではなく、制約条件を見失い、タスク途中で到着する情報を見落とし、ユーザーに問い合わせる代わりに推測し、検証をスキップする。そして、タスクの鍵が回復しなければならない隠れた状態に依存する場合に最も苦戦する。

English

Existing computer-use benchmarks fail to capture the realism, complexity, and long-horizon demands of real-world computer use, limiting their ability to reveal the limitations of frontier agents. We introduce OSWorld 2.0, a benchmark of 108 long-horizon computer-use workflows across everyday and professional tasks, designed to capture complex and challenging real-world phenomena. Each task represents a realistic end-to-end workflow that takes human users a median of about 1.6 hours to complete and requires an average of 318 tool calls with Claude Opus 4.7 using maximum thinking, compared with about 30 in OSWorld 1.0. OSWorld 2.0 targets challenge phenomena that are common in real workflows yet underrepresented in prior benchmarks, spanning interaction-design challenges such as streaming interaction and dynamic environments, as well as agent-pattern challenges such as cross-source reasoning, implicit-state inference, and visual-spatial precision. Tasks are grounded in authentic input artifacts and cross-referenced against realistic stateful user profile data, and include separate safety reports auditing safety-sensitive execution. Under our primary binary-completion metric at 500 steps, Claude Opus 4.8 with maximum thinking and batched tool calls scores best but still completes only 20.6% of tasks at a 54.8% partial score; GPT-5.5 is far more token-efficient yet plateaus near 13%. These results show that current agents are still far from professional-level computer use: rather than stumbling on basic GUI control or coding, they lose track of constraints, miss information that arrives mid-task, guess rather than ask the user, and skip verification, struggling most when a task hinges on hidden state they must recover.