iOSWorld：个人智能手机代理的基准测试

摘要

一个实用的手机智能体需要具备个性化的智能。它应当能够根据用户在设备上的身份、历史记录和偏好进行推理，而不仅仅是在一个非个性化的沙盒环境中执行孤立的指令。现有的移动智能体基准测试缺乏这种个性化特性。我们推出了iOSWorld，这是首个基于持久化用户身份构建的交互式原生iOS模拟器基准测试，涵盖了26个全新构建的iOS应用。这些应用包含相互关联的数据，如交易记录、消息、出行记录、社交关系和财务活动。iOSWorld包含133个任务，分为三个难度递增的类别：单应用任务（27个）测试单个应用，多应用任务（60个）涉及2到8个应用，以及记忆与个性化任务（46个）要求智能体从个人数据中推断模式。我们在纯视觉和特权视觉+XML两种设置下评估了前沿和开源的计算设备使用模型。最佳配置的整体成功率达到了52%，但在多应用任务上仅为37%。特权视觉+XML访问使前沿模型的性能提升了多达26个百分点，而较小的模型并未从增加的辅助功能树输入中受益。我们将iOSWorld作为开源基准测试发布，包含所有应用、预设数据、任务、评分标准和评估代码。

English

A useful phone agent needs to be personally intelligent. It should reason over a user's identity, history, and preferences as they exist on the device, not just follow isolated instructions in an impersonal sandbox. Existing mobile agent benchmarks lack this kind of personalization. We introduce iOSWorld, the first interactive native iOS simulator benchmark built around a persistent user identity spanning 26 newly built iOS apps. These apps contain connected data such as transactions, messages, travel records, social relationships, and financial activity. iOSWorld includes 133 tasks across three increasingly difficult categories. Single-app tasks (27) test one app, multi-app tasks (60) span 2 to 8 apps, and memory and personalization tasks (46) require agents to infer patterns from personal data. We evaluate frontier and open-source computer-use models in both vision-only and privileged vision+XML settings. The best configuration reaches 52\% overall but only 37\% on multi-app tasks. Privileged vision+XML access improves frontier models by up to 26 percentage points, while smaller models do not benefit from added accessibility-tree input. We release iOSWorld as an open-source benchmark with all apps, seeded data, tasks, rubrics, and evaluation code.