MyPCBench: 個人向け知的コンピュータ操作エージェントのためのベンチマーク

要旨

現在のコンピュータ操作エージェント向けベンチマークは、非個人環境でモデルを評価している。これにより、評価と実運用の間にギャップが生じている。パーソナルアシスタントは、ユーザのコンテキスト、履歴データ、ログイン済みアカウントを含むデジタルライフ全体にわたって動作することが期待されているにもかかわらずである。このギャップはWebタスクで最も顕著である。実Web評価では、実際のパーソナルアシスタントが操作しなければならない、ログインや個人情報を必要とするサイトを実行できないからである。我々はMyPCBenchを導入する。これは、17個の模擬実世界Webアプリケーションとフルデスクトップスタックを備えたLinuxデスクトップ上で、コンピュータ操作エージェントをパーソナルアシスタントとしてテストするものである。すべての環境は、『The Office』のマイケル・スコットという一人の標準的ペルソナ向けにシードされている。この環境において、OpenClawコミュニティから集めた実際のリクエストに着想を得た184のタスクを定義し、コンピュータ操作とbashコマンドの統一ツールインターフェースを用いて、6つのクローズドウェイトおよびオープンウェイトモデルをベンチマークする。最良のモデルであるClaude Opus 4.6はタスクの55.4%を完全に解決し、50%を超えた唯一のモデルとなった。モデルの失敗は、多くのアプリケーションにまたがるタスクや、個人化がアシスタントに最も負荷をかける長い軌跡に集中している。我々は環境、タスクセット、およびエージェントハーネスを https://mypcbench.com で公開する。

English

Current benchmarks for computer-use agents evaluate models in impersonal environments. This leaves a gap between evaluation and deployment where personal assistants are expected to work across a user's whole digital life, including their context, historical data, and logged-in accounts. This gap is widest on web tasks, where live web evaluations cannot exercise sites that require logging in or personal information, the kind of site a real personal assistant has to drive. We introduce MyPCBench, which tests computer-use agents as personal assistants on a Linux desktop populated with 17 simulated real-world web applications and a full desktop stack, all seeded for one canonical persona, Michael Scott from The Office. We define 184 tasks in this environment, each inspired by a real request drawn from the OpenClaw community, and benchmark six closed and open-weight models with a uniform computer+bash tool surface. We find that the best model, Claude Opus 4.6, fully solves 55.4\% of the tasks, the only model above 50\%. Model failures cluster on tasks that span many applications and on long trajectories, where personalization stresses an assistant the most. We release the environment, task set, and agent harness at https://mypcbench.com.