PhoneWorld: 扩展手机使用智能体环境

摘要

手机使用代理的一个核心瓶颈是，能够涵盖真实移动行为的可控、可复现环境难以大规模构建。现有的移动代理基准测试虽然在评估方面取得了重要进展，但它们本身无法提供一种可扩展的方式来构建大量新的手机使用环境。我们提出了PhoneWorld——一个可复用的流水线，能够将真实的GUI轨迹与截图转化为可控的手机使用环境、可执行任务、自动化验证器以及训练数据展开。PhoneWorld并非逐个手动构建移动基准测试，而是利用真实轨迹来恢复哪些屏幕是重要的、屏幕之间如何连接、哪些交互必须改变环境状态，以及哪些用户目标可以实现自动验证。基于这些信息，它构建了由只读应用内容和可变状态支持的可运行模拟Android应用，然后从相同环境中衍生出可执行任务、基于规则的验证器以及训练数据展开。在当前的实例中，PhoneWorld涵盖16个领域的34个应用，覆盖搜索、浏览、购物、预订、媒体和社交互动等常见消费者移动行为。在固定训练预算下，将基于AndroidWorld的基线中10K步的辅助AndroidWorld语料替换为广泛的PhoneWorld监督数据，可同时提升全部四个评估基准：HYMobileBench提升17.7个百分点，AndroidControl提升6.0个百分点，AndroidWorld提升14.7个百分点，PhoneWorld提升52.5个百分点。随后我们研究了另外两个扩展问题：增加PhoneWorld监督数据的数量能显著提升PhoneWorld的性能；在固定PhoneWorld预算下，扩大应用覆盖范围能带来更大的收益。总体而言，PhoneWorld将关注点从逐个构建移动基准测试转向了规模化供应手机使用环境本身。

English

A central bottleneck for phone-use agents is that controllable, reproducible environments covering real mobile behavior are hard to build at scale. Existing mobile-agent benchmarks have made important progress on evaluation, but they do not by themselves provide a scalable way to construct many new phone-use environments. We present PhoneWorld, a reusable pipeline that converts real GUI trajectories and screenshots into controllable phone-use environments, executable tasks, automatic verifiers, and training rollouts. Rather than hand-building one mobile benchmark at a time, PhoneWorld uses real trajectories to recover which screens matter, how screens connect, which interactions must change environment state, and which user goals admit automatic verification. From these signals, it builds runnable mock Android apps backed by read-only app content and mutable state, then derives executable tasks, rule-based verifiers, and training rollouts from the same environments. In its current instantiation, PhoneWorld covers 34 apps across 16 domains, spanning common consumer mobile behaviors such as search, browsing, shopping, booking, media, and social interaction. Under a fixed training budget, replacing 10K steps from an auxiliary AndroidWorld corpus in an AndroidWorld-based baseline with broad PhoneWorld supervision improves all four evaluation benchmarks at once, raising HYMobileBench by 17.7 points, AndroidControl by 6.0 points, AndroidWorld by 14.7 points, and PhoneWorld by 52.5 points. We then study two additional scaling questions: increasing the amount of PhoneWorld supervision strongly improves PhoneWorld performance, and under a fixed PhoneWorld budget, expanding app coverage yields even larger gains. Overall, PhoneWorld shifts the focus from building one mobile benchmark at a time to scaling the supply of phone-use environments themselves.