iOSWorld：個人智能手機代理基準

摘要

一個有用的手機代理人需要具備個人化智慧。它應該根據裝置上存在的使用者身分、歷史記錄和偏好進行推理，而不僅僅是在非個人化的沙箱中遵循孤立的指令。現有的行動代理人基準測試缺乏這種個人化。我們引入了 iOSWorld，這是第一個基於持續使用者身分建構的互動式原生 iOS 模擬器基準測試，涵蓋 26 個新開發的 iOS 應用程式。這些應用程式包含相互關聯的數據，例如交易、訊息、旅行記錄、社交關係和財務活動。iOSWorld 包含 133 個任務，分為三個難度遞增的類別。單一應用程式任務（27 個）測試一個應用程式，多應用程式任務（60 個）涵蓋 2 到 8 個應用程式，而記憶與個人化任務（46 個）則要求代理人從個人數據中推斷模式。我們在純視覺和特權視覺+XML 設定下評估前沿和開源的電腦使用模型。最佳配置的整體正確率達到 52%，但在多應用程式任務上僅有 37%。特權視覺+XML 存取將前沿模型提升了最多 26 個百分點，而較小的模型並未從新增的無障礙樹輸入中受益。我們將 iOSWorld 作為開源基準測試發布，包含所有應用程式、種子數據、任務、評分標準和評估程式碼。

English

A useful phone agent needs to be personally intelligent. It should reason over a user's identity, history, and preferences as they exist on the device, not just follow isolated instructions in an impersonal sandbox. Existing mobile agent benchmarks lack this kind of personalization. We introduce iOSWorld, the first interactive native iOS simulator benchmark built around a persistent user identity spanning 26 newly built iOS apps. These apps contain connected data such as transactions, messages, travel records, social relationships, and financial activity. iOSWorld includes 133 tasks across three increasingly difficult categories. Single-app tasks (27) test one app, multi-app tasks (60) span 2 to 8 apps, and memory and personalization tasks (46) require agents to infer patterns from personal data. We evaluate frontier and open-source computer-use models in both vision-only and privileged vision+XML settings. The best configuration reaches 52\% overall but only 37\% on multi-app tasks. Privileged vision+XML access improves frontier models by up to 26 percentage points, while smaller models do not benefit from added accessibility-tree input. We release iOSWorld as an open-source benchmark with all apps, seeded data, tasks, rubrics, and evaluation code.