iOSWorld: 개인 맞춤형 지능형 폰 에이전트를 위한 벤치마크

초록

유용한 폰 에이전트는 개인적 지능을 갖추어야 한다. 즉, 사용자의 신원, 기록, 선호도를 기기 내에서 추론할 수 있어야 하며, 비개인적인 샌드박스에서 고립된 지침을 단순히 따르는 데 그쳐서는 안 된다. 기존의 모바일 에이전트 벤치마크는 이러한 개인화 기능이 부족하다. 이에 우리는 지속적인 사용자 신원을 기반으로 구축된 최초의 대화형 네이티브 iOS 시뮬레이터 벤치마크인 iOSWorld를 소개한다. 이는 26개의 새롭게 구축된 iOS 앱으로 구성되며, 해당 앱에는 거래, 메시지, 여행 기록, 사회적 관계, 금융 활동 등의 연결된 데이터가 포함된다. iOSWorld는 세 가지 난이도 범주에 걸쳐 133개의 작업을 포함한다. 단일 앱 작업(27개)은 하나의 앱을 테스트하고, 다중 앱 작업(60개)은 2~8개의 앱을 대상으로 하며, 기억 및 개인화 작업(46개)은 에이전트가 개인 데이터로부터 패턴을 추론하도록 요구한다. 우리는 비전 전용 및 특권적 비전+XML 설정 모두에서 최첨단 및 오픈소스 컴퓨터 사용 모델을 평가했다. 최고 성능 구성은 전체적으로 52%의 정확도를 달성했지만, 다중 앱 작업에서는 37%에 그쳤다. 특권적 비전+XML 접근은 최첨단 모델의 성능을 최대 26% 포인트 향상시켰으나, 소형 모델은 추가된 접근성 트리 입력으로부터 이점을 얻지 못했다. 우리는 iOSWorld를 모든 앱, 시드 데이터, 작업, 평가 기준 및 평가 코드와 함께 오픈소스 벤치마크로 공개한다.

English

A useful phone agent needs to be personally intelligent. It should reason over a user's identity, history, and preferences as they exist on the device, not just follow isolated instructions in an impersonal sandbox. Existing mobile agent benchmarks lack this kind of personalization. We introduce iOSWorld, the first interactive native iOS simulator benchmark built around a persistent user identity spanning 26 newly built iOS apps. These apps contain connected data such as transactions, messages, travel records, social relationships, and financial activity. iOSWorld includes 133 tasks across three increasingly difficult categories. Single-app tasks (27) test one app, multi-app tasks (60) span 2 to 8 apps, and memory and personalization tasks (46) require agents to infer patterns from personal data. We evaluate frontier and open-source computer-use models in both vision-only and privileged vision+XML settings. The best configuration reaches 52\% overall but only 37\% on multi-app tasks. Privileged vision+XML access improves frontier models by up to 26 percentage points, while smaller models do not benefit from added accessibility-tree input. We release iOSWorld as an open-source benchmark with all apps, seeded data, tasks, rubrics, and evaluation code.