ChatPaper.aiChatPaper

训练开放模型以实现代理式手机操作

Training Open Models for Agentic Phone Use

June 22, 2026
作者: Zhengyang Tang, Xin Lai, Pengyuan Lyu, Xinyuan Wang, Tianyi Bai, Chenxin Li, Yiduo Guo, Huawen Shen, Yuxuan Liu, Junyi Li, Zhengyao Fang, Yang Ding, Yi Zhang, Weinong Wang, Xingran Zhou, Liang Wu, Fei Tang, Sunqi Fan, Shangpin Peng, Zheng Ruan, Anran Zhang, Benyou Wang, Ji-Rong Wen, Rui Yan, Chengquan Zhang, Han Hu
cs.AI

摘要

手机正逐渐成为通用智能体重要的执行界面,但训练用于可靠手机操作的开源模型仍面临挑战:部署时真正相关的环境——运行真实应用的真实设备——速度慢、带有状态、具有副作用且难以重置或验证,而可扩展的模拟环境仅能近似真实行为。我们提出PhoneBuddy,一个面向手机智能体操作的开源模型训练方案,它结合了真实应用环境与模拟应用环境PhoneWorld——后者能从真实图形用户界面使用结构重建可运行的模拟应用。PhoneBuddy首先利用在两个环境中采集的轨迹构建共享的监督微调阶段,随后对比单独基于真实应用的强化学习与混合两个环境的强化学习。在涉及真实手机上的150项任务的人工评估中(涵盖应用、迷你应用及跨应用工作流),任务成功率从监督微调后的36.67%提升至真实应用强化学习后的40.67%,再提升至混合强化学习后的45.33%。在AndroidWorld基准上,同一进程的指标从60.3%升至77.2%再升至83.2%。这些结果表明,模拟应用训练并非真实应用强化学习的替代品,而是可扩展、可重置且可自动检查交互的补充来源。其增益在应用和迷你应用任务上最为显著,而长跨度的跨应用工作流仍是一个重要的开放挑战。
English
Phones are becoming an important execution surface for general-purpose agents, but training open models for reliable phone use remains difficult because the environment that matters at deployment, real devices running real apps, is slow, stateful, side-effectful, and hard to reset or verify, while scalable mock environments only approximate real behavior. We present PhoneBuddy, a training recipe and open-model line for agentic phone use that combines a real-app environment with a mock-app environment, PhoneWorld, which reconstructs runnable mock apps from real GUI usage structure. PhoneBuddy first builds a shared supervised fine-tuning stage from trajectories collected in both environments, then compares real-app RL against mixed RL across both environments. Across a 150-task human evaluation on real phones spanning apps, mini-apps, and cross-app workflows, task success rate improves from 36.67\% after supervised fine-tuning to 40.67\% after real-app RL and 45.33\% after mixed RL. On AndroidWorld, the same progression rises from 60.3\% to 77.2\% to 83.2\%. These results show that mock-app training is not a replacement for real-app RL, but a complementary source of scalable, resettable, and automatically checked interaction. The gains are strongest on app and mini-app tasks, while long-horizontal cross-app workflows remain an important open challenge.