MyPCBench: 개인 맞춤형 지능형 컴퓨터 사용 에이전트를 위한 벤치마크

초록

현재 컴퓨터 사용 에이전트를 위한 벤치마크는 비개인적 환경에서 모델을 평가한다. 이로 인해 개인 비서가 사용자의 전체 디지털 생활, 즉 맥락, 과거 데이터 및 로그인된 계정을 포함한 환경에서 작동해야 하는 평가와 배포 간에 격차가 발생한다. 이 격차는 웹 작업에서 가장 두드러지는데, 실시간 웹 평가는 로그인이나 개인 정보가 필요한 사이트, 즉 실제 개인 비서가 다루어야 하는 유형의 사이트를 실행할 수 없기 때문이다. 본 논문에서는 Linux 데스크탑 환경에서 17개의 시뮬레이션된 실제 웹 애플리케이션과 완전한 데스크탑 스택을 갖추고, 모두 《오피스》의 마이클 스콧이라는 한 명의 표준 인물에 대해 시드된 MyPCBench를 소개한다. 이 환경에서 OpenClaw 커뮤니티에서 가져온 실제 요청에서 영감을 받은 184개의 작업을 정의하고, 균일한 컴퓨터+배시 도구 표면을 사용하여 6개의 폐쇄형 및 오픈웨이트 모델을 벤치마킹했다. 최고 성능 모델인 Claude Opus 4.6은 작업의 55.4%를 완전히 해결하여 50%를 넘는 유일한 모델이었다. 모델 실패는 여러 애플리케이션에 걸친 작업과 긴 궤적에서 집중되었으며, 이때 개인화가 어시스턴트에게 가장 큰 부담을 주었다. 환경, 작업 세트 및 에이전트 하네스를 https://mypcbench.com에서 공개한다.

English

Current benchmarks for computer-use agents evaluate models in impersonal environments. This leaves a gap between evaluation and deployment where personal assistants are expected to work across a user's whole digital life, including their context, historical data, and logged-in accounts. This gap is widest on web tasks, where live web evaluations cannot exercise sites that require logging in or personal information, the kind of site a real personal assistant has to drive. We introduce MyPCBench, which tests computer-use agents as personal assistants on a Linux desktop populated with 17 simulated real-world web applications and a full desktop stack, all seeded for one canonical persona, Michael Scott from The Office. We define 184 tasks in this environment, each inspired by a real request drawn from the OpenClaw community, and benchmark six closed and open-weight models with a uniform computer+bash tool surface. We find that the best model, Claude Opus 4.6, fully solves 55.4\% of the tasks, the only model above 50\%. Model failures cluster on tasks that span many applications and on long trajectories, where personalization stresses an assistant the most. We release the environment, task set, and agent harness at https://mypcbench.com.