训练开放模型以实现代理式手机操作

摘要

手机正逐渐成为通用智能体重要的执行界面，但训练用于可靠手机操作的开源模型仍面临挑战：部署时真正相关的环境——运行真实应用的真实设备——速度慢、带有状态、具有副作用且难以重置或验证，而可扩展的模拟环境仅能近似真实行为。我们提出PhoneBuddy，一个面向手机智能体操作的开源模型训练方案，它结合了真实应用环境与模拟应用环境PhoneWorld——后者能从真实图形用户界面使用结构重建可运行的模拟应用。PhoneBuddy首先利用在两个环境中采集的轨迹构建共享的监督微调阶段，随后对比单独基于真实应用的强化学习与混合两个环境的强化学习。在涉及真实手机上的150项任务的人工评估中（涵盖应用、迷你应用及跨应用工作流），任务成功率从监督微调后的36.67%提升至真实应用强化学习后的40.67%，再提升至混合强化学习后的45.33%。在AndroidWorld基准上，同一进程的指标从60.3%升至77.2%再升至83.2%。这些结果表明，模拟应用训练并非真实应用强化学习的替代品，而是可扩展、可重置且可自动检查交互的补充来源。其增益在应用和迷你应用任务上最为显著，而长跨度的跨应用工作流仍是一个重要的开放挑战。

English

Phones are becoming an important execution surface for general-purpose agents, but training open models for reliable phone use remains difficult because the environment that matters at deployment, real devices running real apps, is slow, stateful, side-effectful, and hard to reset or verify, while scalable mock environments only approximate real behavior. We present PhoneBuddy, a training recipe and open-model line for agentic phone use that combines a real-app environment with a mock-app environment, PhoneWorld, which reconstructs runnable mock apps from real GUI usage structure. PhoneBuddy first builds a shared supervised fine-tuning stage from trajectories collected in both environments, then compares real-app RL against mixed RL across both environments. Across a 150-task human evaluation on real phones spanning apps, mini-apps, and cross-app workflows, task success rate improves from 36.67\% after supervised fine-tuning to 40.67\% after real-app RL and 45.33\% after mixed RL. On AndroidWorld, the same progression rises from 60.3\% to 77.2\% to 83.2\%. These results show that mock-app training is not a replacement for real-app RL, but a complementary source of scalable, resettable, and automatically checked interaction. The gains are strongest on app and mini-app tasks, while long-horizontal cross-app workflows remain an important open challenge.