PhoneWorld: スマートフォン操作エージェント環境のスケーリング

要旨

電話操作エージェントの中心的なボトルネックは、現実のモバイル操作をカバーする制御可能かつ再現可能な環境を大規模に構築することが難しい点にある。既存のモバイルエージェントベンチマークは評価において重要な進歩を遂げてきたものの、それ自体では多数の新しい電話操作環境をスケーラブルに構築する方法を提供していない。本稿では、実在のGUI軌跡とスクリーンショットを制御可能な電話操作環境、実行可能なタスク、自動検証器、訓練用ロールアウトに変換する再利用可能なパイプラインであるPhoneWorldを提案する。PhoneWorldは、モバイルベンチマークを一つずつ手作業で構築する代わりに、実軌跡を用いて、どの画面が重要か、画面間の接続、どのインタラクションが環境状態を変更する必要があるか、どのユーザー目標が自動検証を許容するかを復元する。これらの信号から、読み取り専用のアプリコンテンツと可変状態に基づく実行可能な模擬Androidアプリを構築し、同一の環境から実行可能なタスク、ルールベースの検証器、訓練用ロールアウトを導出する。現在の実装では、PhoneWorldは16ドメインにわたる34のアプリをカバーし、検索、ブラウジング、ショッピング、予約、メディア、ソーシャルインタラクションなどの一般的な消費者向けモバイル行動を網羅している。固定の訓練予算の下で、AndroidWorldベースのベースラインにおける補助的なAndroidWorldコーパスからの10Kステップを、PhoneWorldによる広範な監視に置き換えることで、4つの評価ベンチマークすべてが同時に改善される。具体的には、HYMobileBenchが17.7ポイント、AndroidControlが6.0ポイント、AndroidWorldが14.7ポイント、PhoneWorldが52.5ポイント向上する。さらに、二つのスケーリングに関する追加の疑問を調査する。PhoneWorldによる監視量を増やすとPhoneWorldのパフォーマンスが大幅に向上し、固定のPhoneWorld予算の下ではアプリカバレッジを拡大することでさらに大きな利得が得られる。全体として、PhoneWorldはモバイルベンチマークを一つずつ構築することから、電話操作環境自体の供給をスケーリングすることへと焦点を移す。

English

A central bottleneck for phone-use agents is that controllable, reproducible environments covering real mobile behavior are hard to build at scale. Existing mobile-agent benchmarks have made important progress on evaluation, but they do not by themselves provide a scalable way to construct many new phone-use environments. We present PhoneWorld, a reusable pipeline that converts real GUI trajectories and screenshots into controllable phone-use environments, executable tasks, automatic verifiers, and training rollouts. Rather than hand-building one mobile benchmark at a time, PhoneWorld uses real trajectories to recover which screens matter, how screens connect, which interactions must change environment state, and which user goals admit automatic verification. From these signals, it builds runnable mock Android apps backed by read-only app content and mutable state, then derives executable tasks, rule-based verifiers, and training rollouts from the same environments. In its current instantiation, PhoneWorld covers 34 apps across 16 domains, spanning common consumer mobile behaviors such as search, browsing, shopping, booking, media, and social interaction. Under a fixed training budget, replacing 10K steps from an auxiliary AndroidWorld corpus in an AndroidWorld-based baseline with broad PhoneWorld supervision improves all four evaluation benchmarks at once, raising HYMobileBench by 17.7 points, AndroidControl by 6.0 points, AndroidWorld by 14.7 points, and PhoneWorld by 52.5 points. We then study two additional scaling questions: increasing the amount of PhoneWorld supervision strongly improves PhoneWorld performance, and under a fixed PhoneWorld budget, expanding app coverage yields even larger gains. Overall, PhoneWorld shifts the focus from building one mobile benchmark at a time to scaling the supply of phone-use environments themselves.