MyPCBench: 个性化智能计算机操作智能体基准测试

摘要

当前的计算机使用代理基准在非个人化环境中评估模型。这造成了评估与部署之间的差距，在部署中，个人助手需要跨越用户的整个数字生活，包括其上下文、历史数据和已登录账户。这一差距在网络任务中最为明显，因为实时网络评估无法测试需要登录或个人信息才能使用的网站——而这类网站正是真正的个人助手所必须操作的。我们引入了MyPCBench，该框架在Linux桌面上测试作为个人助手的计算机使用代理，桌面中部署了17个模拟现实世界的网络应用及完整的桌面堆栈，所有内容均为一个标准角色（《办公室》中的迈克尔·斯科特）预填充。我们在该环境中定义了184个任务，每个任务均源自OpenClaw社区的真实请求，并采用统一的计算机+bash工具接口对六个闭源和开源模型进行了基准测试。我们发现，最佳模型Claude Opus 4.6完全解决了55.4%的任务，是唯一超过50%的模型。模型失败集中在涉及多个应用的任务以及长轨迹上，此时个性化对助手的压力最大。我们在https://mypcbench.com发布了该环境、任务集和代理工具。

English

Current benchmarks for computer-use agents evaluate models in impersonal environments. This leaves a gap between evaluation and deployment where personal assistants are expected to work across a user's whole digital life, including their context, historical data, and logged-in accounts. This gap is widest on web tasks, where live web evaluations cannot exercise sites that require logging in or personal information, the kind of site a real personal assistant has to drive. We introduce MyPCBench, which tests computer-use agents as personal assistants on a Linux desktop populated with 17 simulated real-world web applications and a full desktop stack, all seeded for one canonical persona, Michael Scott from The Office. We define 184 tasks in this environment, each inspired by a real request drawn from the OpenClaw community, and benchmark six closed and open-weight models with a uniform computer+bash tool surface. We find that the best model, Claude Opus 4.6, fully solves 55.4\% of the tasks, the only model above 50\%. Model failures cluster on tasks that span many applications and on long trajectories, where personalization stresses an assistant the most. We release the environment, task set, and agent harness at https://mypcbench.com.