PhoneHarness:透過混合GUI、CLI與工具操作來操控手機使用代理
PhoneHarness: Harnessing Phone-Use Agents through Mixed GUI, CLI, and Tool Actions
June 12, 2026
作者: Chenxin Li, Zhengyao Fang, Zhengyang Tang, Pengyuan Lyu, Xingran Zhou, Xin Lai, Fei Tang, Liang Wu, Yiduo Guo, Weinong Wang, Junyi Li, Yi Zhang, Yang Ding, Huawen Shen, Sunqi Fan, Shangpin Peng, Zheng Ruan, Anran Zhang, Benyou Wang, Chengquan Zhang, Han Hu
cs.AI
摘要
手機代理程式日益被期望能完成實際的行動工作流程,而非僅預測下一個螢幕動作。然而,目前多數行動代理程式文獻仍主要將代理程式評估為GUI控制器,即觀察螢幕、發出點擊與滑動指令,並根據目標應用狀態進行評分。實際的手機使用任務範圍更廣:它們需要決定何時使用應用GUI、裝置端指令或結構化工具,同時留下證據證明預期的副作用確實發生。我們引進PhoneHarness,一個混合動作基準測試與執行框架,用於研究在可驗證的行動工作流程中的手機使用代理程式。PhoneHarness透過GUI、CLI與主機端工具動作執行裝置端的代理程式循環,結合確定性動作路由、有限GUI委派與可稽核的執行軌跡。其基準測試PhoneHarness Bench評估代理程式是否能完成具可觀察副作用的任務,而非僅產生看似合理的最终答案。在已註釋的評估劃分中,PhoneHarness達到75.0%的通過率,比最強的非PhoneHarness設定高出12.9個百分點。因此,PhoneHarness與PhoneHarness Bench扮演著不同但相互依存的角色:框架使混合手機工作流程可執行,而基準測試則衡量代理程式能否可靠且安全地使用該框架。我們的研究結果顯示,可靠的手機自動化取決於動作表面路由與可驗證的執行,而非僅視覺上的GUI控制。
English
Phone agents are increasingly expected to complete real mobile workflows rather than merely predict the next screen action. However, much of the current mobile-agent literature still evaluates agents primarily as GUI controllers that observe a screen, emit taps and swipes, and are scored by target app state. Real phone-use tasks are broader: they require deciding when to use app GUIs, device-side commands, or structured tools, while leaving evidence that the intended side effect actually occurred. We introduce PhoneHarness, a mixed-action benchmark and execution harness for studying phone-use agents on verifiable mobile workflows. PhoneHarness runs a device-side agent loop over GUI, CLI, and host-side tool actions, combining deterministic action routing with bounded GUI delegation and auditable execution traces. Its benchmark, PhoneHarness Bench, evaluates whether agents complete tasks with observable side effects, not only whether they produce plausible final answers. On the annotated evaluation split, PhoneHarness reaches a 75.0% pass rate, outperforming the strongest non-PhoneHarness settings by 12.9 percentage points. PhoneHarness and PhoneHarness Bench therefore play distinct but mutually dependent roles: the harness makes mixed phone workflows executable, while the benchmark measures whether agents can use that harness reliably and safely. Our findings suggest that reliable phone automation depends on action-surface routing and verifiable execution, not only visual GUI control.