에이전트적 전화 사용을 위한 오픈 모델 훈련

초록

휴대폰은 범용 에이전트의 중요한 실행 표면이 되고 있지만, 배포 시 실제 환경(실제 앱이 실행되는 실제 기기)은 느리고, 상태를 가지며, 부작용이 있고, 리셋이나 검증이 어려운 반면, 확장 가능한 모의 환경은 실제 동작을 근사할 뿐이므로, 안정적인 휴대폰 사용을 위한 공개 모델을 훈련하는 것은 여전히 어렵다. 본 논문에서는 실제 앱 환경과 모의 앱 환경(PhoneWorld)을 결합한, 에이전트 휴대폰 사용을 위한 훈련 레시피이자 공개 모델 라인인 PhoneBuddy를 제시한다. PhoneWorld는 실제 GUI 사용 구조에서 실행 가능한 모의 앱을 재구성한다. PhoneBuddy는 먼저 두 환경에서 수집된 궤적으로부터 공유된 지도 미세 조정 단계를 구축한 후, 실제 앱 강화 학습(RL)과 두 환경에 걸친 혼합 RL을 비교한다. 실제 휴대폰에서 앱, 미니 앱, 교차 앱 워크플로를 포괄하는 150개 작업에 대한 인간 평가에서, 작업 성공률은 지도 미세 조정 후 36.67%에서 실제 앱 RL 후 40.67%, 혼합 RL 후 45.33%로 향상되었다. AndroidWorld에서는 동일한 진행이 60.3%에서 77.2%로, 다시 83.2%로 상승했다. 이러한 결과는 모의 앱 훈련이 실제 앱 RL을 대체하는 것이 아니라, 확장 가능하고 리셋 가능하며 자동으로 검증 가능한 상호작용을 제공하는 보완적 원천임을 보여준다. 성능 향상은 앱 및 미니 앱 작업에서 가장 두드러졌으며, 장기적인 교차 앱 워크플로는 여전히 중요한 미해결 과제로 남아 있다.

English

Phones are becoming an important execution surface for general-purpose agents, but training open models for reliable phone use remains difficult because the environment that matters at deployment, real devices running real apps, is slow, stateful, side-effectful, and hard to reset or verify, while scalable mock environments only approximate real behavior. We present PhoneBuddy, a training recipe and open-model line for agentic phone use that combines a real-app environment with a mock-app environment, PhoneWorld, which reconstructs runnable mock apps from real GUI usage structure. PhoneBuddy first builds a shared supervised fine-tuning stage from trajectories collected in both environments, then compares real-app RL against mixed RL across both environments. Across a 150-task human evaluation on real phones spanning apps, mini-apps, and cross-app workflows, task success rate improves from 36.67\% after supervised fine-tuning to 40.67\% after real-app RL and 45.33\% after mixed RL. On AndroidWorld, the same progression rises from 60.3\% to 77.2\% to 83.2\%. These results show that mock-app training is not a replacement for real-app RL, but a complementary source of scalable, resettable, and automatically checked interaction. The gains are strongest on app and mini-app tasks, while long-horizontal cross-app workflows remain an important open challenge.