PhoneHarness: 혼합된 GUI, CLI 및 도구 작업을 통한 전화 사용 에이전트 활용

초록

폰 에이전트는 단순히 다음 화면 동작을 예측하는 것을 넘어 실제 모바일 워크플로를 완료할 수 있을 것으로 점점 더 기대되고 있다. 그러나 현재의 모바일 에이전트 관련 문헌 대부분은 여전히 에이전트를 화면을 관찰하고 탭과 스와이프를 생성하며 대상 앱 상태에 따라 점수가 매겨지는 GUI 컨트롤러로 주로 평가하고 있다. 실제 폰 사용 작업은 이보다 더 광범위하다. 즉, 앱 GUI, 기기 측 명령어 또는 구조화된 도구를 언제 사용할지 결정해야 하며, 의도한 부수 효과가 실제로 발생했음을 입증하는 증거를 남겨야 한다. 우리는 검증 가능한 모바일 워크플로에서 폰 사용 에이전트를 연구하기 위한 혼합 행동 벤치마크 및 실행 하네스인 PhoneHarness를 소개한다. PhoneHarness는 GUI, CLI 및 호스트 측 도구 동작에 대해 기기 측 에이전트 루프를 실행하며, 결정적 동작 라우팅과 제한된 GUI 위임 및 감사 가능한 실행 추적을 결합한다. 해당 벤치마크인 PhoneHarness Bench는 에이전트가 그럴듯한 최종 답변을 생성하는지 여부뿐만 아니라 관찰 가능한 부수 효과를 가지고 작업을 완료하는지 평가한다. 주석이 달린 평가 분할에서 PhoneHarness는 75.0%의 통과율을 달성하여 가장 강력한 비PhoneHarness 설정보다 12.9% 포인트 우수한 성능을 보였다. 따라서 PhoneHarness와 PhoneHarness Bench는 별개이면서도 상호 의존적인 역할을 수행한다. 즉, 하네스는 혼합 폰 워크플로를 실행 가능하게 만들고, 벤치마크는 에이전트가 해당 하네스를 안정적이고 안전하게 사용할 수 있는지 측정한다. 우리의 발견은 안정적인 폰 자동화가 시각적 GUI 제어뿐만 아니라 동작 표면 라우팅과 검증 가능한 실행에 달려 있음을 시사한다.

English

Phone agents are increasingly expected to complete real mobile workflows rather than merely predict the next screen action. However, much of the current mobile-agent literature still evaluates agents primarily as GUI controllers that observe a screen, emit taps and swipes, and are scored by target app state. Real phone-use tasks are broader: they require deciding when to use app GUIs, device-side commands, or structured tools, while leaving evidence that the intended side effect actually occurred. We introduce PhoneHarness, a mixed-action benchmark and execution harness for studying phone-use agents on verifiable mobile workflows. PhoneHarness runs a device-side agent loop over GUI, CLI, and host-side tool actions, combining deterministic action routing with bounded GUI delegation and auditable execution traces. Its benchmark, PhoneHarness Bench, evaluates whether agents complete tasks with observable side effects, not only whether they produce plausible final answers. On the annotated evaluation split, PhoneHarness reaches a 75.0% pass rate, outperforming the strongest non-PhoneHarness settings by 12.9 percentage points. PhoneHarness and PhoneHarness Bench therefore play distinct but mutually dependent roles: the harness makes mixed phone workflows executable, while the benchmark measures whether agents can use that harness reliably and safely. Our findings suggest that reliable phone automation depends on action-surface routing and verifiable execution, not only visual GUI control.