PhoneWorld：擴展手機操作代理環境

摘要

手机操作代理的一个核心瓶颈在于，能够覆盖真实移动行为的可控、可复现环境难以大规模构建。现有移动代理基准测试在评估方面取得了重要进展，但其本身并未提供一种可扩展的方式来构建大量新型手机使用环境。我们提出PhoneWorld，这是一个可复用的流水线，能将真实的GUI轨迹及截图转化为可控的手机使用环境、可执行任务、自动验证器以及训练展开数据。PhoneWorld并非每次手动构建一个移动基准测试，而是利用真实轨迹来恢复哪些屏幕具有重要性、屏幕之间如何连接、哪些交互必须改变环境状态，以及哪些用户目标能够自动验证。通过这些信号，它构建了基于只读应用内容和可变状态的可运行模拟Android应用，进而从同一环境中衍生出可执行任务、基于规则的验证器以及训练展开数据。在当前实现中，PhoneWorld覆盖了16个领域的34个应用，涵盖搜索、浏览、购物、预订、媒体和社交互动等常见消费者移动行为。在固定训练预算下，将一个基于AndroidWorld的基线中来自辅助AndroidWorld语料库的1万步替换为广泛的PhoneWorld监督，同时提升了所有四个评估基准：HYMobileBench提升17.7个百分点，AndroidControl提升6.0个百分点，AndroidWorld提升14.7个百分点，PhoneWorld提升52.5个百分点。随后我们研究了另外两个规模化问题：增加PhoneWorld监督量可大幅提升PhoneWorld性能，而在固定PhoneWorld预算下，扩大应用覆盖范围能带来更大的收益。总体而言，PhoneWorld将焦点从每次构建一个移动基准测试，转向了规模化供给手机使用环境本身。

English

A central bottleneck for phone-use agents is that controllable, reproducible environments covering real mobile behavior are hard to build at scale. Existing mobile-agent benchmarks have made important progress on evaluation, but they do not by themselves provide a scalable way to construct many new phone-use environments. We present PhoneWorld, a reusable pipeline that converts real GUI trajectories and screenshots into controllable phone-use environments, executable tasks, automatic verifiers, and training rollouts. Rather than hand-building one mobile benchmark at a time, PhoneWorld uses real trajectories to recover which screens matter, how screens connect, which interactions must change environment state, and which user goals admit automatic verification. From these signals, it builds runnable mock Android apps backed by read-only app content and mutable state, then derives executable tasks, rule-based verifiers, and training rollouts from the same environments. In its current instantiation, PhoneWorld covers 34 apps across 16 domains, spanning common consumer mobile behaviors such as search, browsing, shopping, booking, media, and social interaction. Under a fixed training budget, replacing 10K steps from an auxiliary AndroidWorld corpus in an AndroidWorld-based baseline with broad PhoneWorld supervision improves all four evaluation benchmarks at once, raising HYMobileBench by 17.7 points, AndroidControl by 6.0 points, AndroidWorld by 14.7 points, and PhoneWorld by 52.5 points. We then study two additional scaling questions: increasing the amount of PhoneWorld supervision strongly improves PhoneWorld performance, and under a fixed PhoneWorld budget, expanding app coverage yields even larger gains. Overall, PhoneWorld shifts the focus from building one mobile benchmark at a time to scaling the supply of phone-use environments themselves.