PhoneHarness: GUI、CLI、およびツールアクションを混在させた電話利用エージェントの活用

要旨

電話エージェントは、単に次の画面アクションを予測するだけでなく、実際のモバイルワークフローを完了することがますます期待されている。しかし、現在のモバイルエージェントに関する文献の多くは、依然としてエージェントを主にGUIコントローラとして評価しており、画面を観察し、タップやスワイプを実行し、ターゲットアプリの状態によってスコア付けされる。実際の電話使用タスクはより広範であり、アプリのGUI、デバイス側のコマンド、または構造化ツールをいつ使用するかを決定し、意図した副作用が実際に発生したという証拠を残すことが求められる。我々は、検証可能なモバイルワークフロー上で電話使用エージェントを研究するための混合アクションベンチマークおよび実行ハーネスであるPhoneHarnessを紹介する。PhoneHarnessは、GUI、CLI、およびホスト側ツールアクションにわたるデバイス側エージェントループを実行し、決定論的なアクションルーティングと制限付きGUI委任および監査可能な実行トレースを組み合わせる。そのベンチマークであるPhoneHarness Benchは、エージェントがもっともらしい最終回答を生成するかどうかだけでなく、観察可能な副作用を伴うタスクを完了するかどうかを評価する。注釈付き評価分割において、PhoneHarnessは75.0%の合格率を達成し、最も強力な非PhoneHarness設定を12.9パーセントポイント上回った。したがって、PhoneHarnessとPhoneHarness Benchは、明確かつ相互依存的役割を果たす。すなわち、ハーネスは混合電話ワークフローを実行可能にし、ベンチマークはエージェントがそのハーネスを信頼性高く安全に使用できるかを測定する。我々の発見は、信頼性の高い電話自動化が、視覚的なGUI制御だけでなく、アクションサーフェスルーティングと検証可能な実行に依存することを示唆している。

English

Phone agents are increasingly expected to complete real mobile workflows rather than merely predict the next screen action. However, much of the current mobile-agent literature still evaluates agents primarily as GUI controllers that observe a screen, emit taps and swipes, and are scored by target app state. Real phone-use tasks are broader: they require deciding when to use app GUIs, device-side commands, or structured tools, while leaving evidence that the intended side effect actually occurred. We introduce PhoneHarness, a mixed-action benchmark and execution harness for studying phone-use agents on verifiable mobile workflows. PhoneHarness runs a device-side agent loop over GUI, CLI, and host-side tool actions, combining deterministic action routing with bounded GUI delegation and auditable execution traces. Its benchmark, PhoneHarness Bench, evaluates whether agents complete tasks with observable side effects, not only whether they produce plausible final answers. On the annotated evaluation split, PhoneHarness reaches a 75.0% pass rate, outperforming the strongest non-PhoneHarness settings by 12.9 percentage points. PhoneHarness and PhoneHarness Bench therefore play distinct but mutually dependent roles: the harness makes mixed phone workflows executable, while the benchmark measures whether agents can use that harness reliably and safely. Our findings suggest that reliable phone automation depends on action-surface routing and verifiable execution, not only visual GUI control.