PhoneHarness：通过混合GUI、CLI与工具操作驾驭手机使用代理

摘要

手机代理日益被期望能完成真实的移动工作流程，而不仅仅是预测下一个屏幕操作。然而，当前大多数移动代理研究仍主要将代理视为GUI控制器——观察屏幕、执行点击和滑动操作，并通过目标应用状态来评分。实际的手机使用任务更为广泛：它们需要判断何时使用应用GUI、设备端命令或结构化工具，同时留下证据表明预期的副作用确实发生了。为此，我们提出PhoneHarness——一个混合操作基准与执行框架，用于研究可验证移动工作流程中的手机使用代理。PhoneHarness在设备端运行代理循环，整合GUI、CLI和主机端工具操作，结合确定性操作路由、有界GUI委托和可审计执行轨迹。其基准测试PhoneHarness Bench评估代理是否完成具有可观测副作用的任务，而不仅仅判断其是否给出看似合理的最终答案。在标注的评估子集上，PhoneHarness达到75.0%的通过率，比非PhoneHarness的最强设置高出12.9个百分点。因此，PhoneHarness与PhoneHarness Bench扮演着相互区别但相互依赖的角色：框架使混合手机工作流程可执行，而基准测试衡量代理能否可靠且安全地使用该框架。我们的研究结果表明，可靠的手机自动化不仅依赖于视觉GUI控制，更依赖于操作界面路由与可验证执行。

English

Phone agents are increasingly expected to complete real mobile workflows rather than merely predict the next screen action. However, much of the current mobile-agent literature still evaluates agents primarily as GUI controllers that observe a screen, emit taps and swipes, and are scored by target app state. Real phone-use tasks are broader: they require deciding when to use app GUIs, device-side commands, or structured tools, while leaving evidence that the intended side effect actually occurred. We introduce PhoneHarness, a mixed-action benchmark and execution harness for studying phone-use agents on verifiable mobile workflows. PhoneHarness runs a device-side agent loop over GUI, CLI, and host-side tool actions, combining deterministic action routing with bounded GUI delegation and auditable execution traces. Its benchmark, PhoneHarness Bench, evaluates whether agents complete tasks with observable side effects, not only whether they produce plausible final answers. On the annotated evaluation split, PhoneHarness reaches a 75.0% pass rate, outperforming the strongest non-PhoneHarness settings by 12.9 percentage points. PhoneHarness and PhoneHarness Bench therefore play distinct but mutually dependent roles: the harness makes mixed phone workflows executable, while the benchmark measures whether agents can use that harness reliably and safely. Our findings suggest that reliable phone automation depends on action-surface routing and verifiable execution, not only visual GUI control.