MyPCBench：個人化智慧型電腦使用代理之基準

摘要

目前用於電腦操作代理的基準測試是在非個人化的環境中評估模型。這導致評估與實際部署之間存在差距，因為個人助手預期要在用戶的完整數位生活中運作，包括其情境、歷史資料以及已登入的帳戶。此差距在網頁任務上最為顯著，因為即時網頁評估無法操作需要登入或個人資訊的網站，而這正是真正的個人助手必須操作的網站類型。我們介紹了 MyPCBench，此基準測試在一個 Linux 桌面上測試作為個人助手的電腦操作代理，該桌面搭載了 17 個模擬的真實網頁應用程式與完整的桌面系統，並皆以一個典範人物設定（《辦公室》中的麥可·史考特）作為基礎。我們在此環境中定義了 184 個任務，每個任務的靈感皆來自 OpenClaw 社群的真實請求，並以統一的電腦加 Bash 工具介面，對六個封閉和開放權重模型進行基準測試。我們發現最佳模型 Claude Opus 4.6 能完全解決 55.4% 的任務，是唯一超過 50% 的模型。模型的失敗主要集中在跨越多個應用程式的任務以及冗長的操作軌跡上，在這些情況中，個人化對助手造成的壓力最大。我們在 https://mypcbench.com 發布了此環境、任務集與代理框架。

English

Current benchmarks for computer-use agents evaluate models in impersonal environments. This leaves a gap between evaluation and deployment where personal assistants are expected to work across a user's whole digital life, including their context, historical data, and logged-in accounts. This gap is widest on web tasks, where live web evaluations cannot exercise sites that require logging in or personal information, the kind of site a real personal assistant has to drive. We introduce MyPCBench, which tests computer-use agents as personal assistants on a Linux desktop populated with 17 simulated real-world web applications and a full desktop stack, all seeded for one canonical persona, Michael Scott from The Office. We define 184 tasks in this environment, each inspired by a real request drawn from the OpenClaw community, and benchmark six closed and open-weight models with a uniform computer+bash tool surface. We find that the best model, Claude Opus 4.6, fully solves 55.4\% of the tasks, the only model above 50\%. Model failures cluster on tasks that span many applications and on long trajectories, where personalization stresses an assistant the most. We release the environment, task set, and agent harness at https://mypcbench.com.