KnowU-Bench: 상호작용적, 능동적, 개인 맞춤형 모바일 에이전트 평가를 향하여

초록

사용자 선호도를 추론하고 능동적 지원을 조정하는 맞춤형 모바일 에이전트는 일상적인 디지털 어시스턴트로서 큰 잠재력을 지니지만, 기존 벤치마크는 이를 평가하기에 한계가 있습니다. 기존 연구는 정적 기록에서의 선호도 복원이나 고정된 맥락에서의 의도 예측을 평가하지만, 상호작용을 통해 누락된 선호도를 도출하는 능력이나 실시간 GUI 환경에서 개입 시기, 동의 획득, 침묵 유지 판단 능력을 검증하지 못합니다. 본 연구는 재현 가능한 Android 에뮬레이션 환경에 구축된 맞춤형 모바일 에이전트 온라인 벤치마크인 KnowU-Bench를 소개합니다. 이는 42개의 일반 GUI 작업, 86개의 맞춤형 작업, 64개의 능동적 작업을 포함합니다. 사용자 선호도를 정적 맥락으로 취급하는 기존 방식과 달리, KnowU-Bench는 사용자 프로필을 에이전트로부터 숨기고 행동 로그만 노출함으로써 맥락 조회가 아닌 진정한 선호도 추론을 강제합니다. 다중 턴 선호도 도출을 지원하기 위해 구조화된 프로필에 기반한 LLM 기반 사용자 시뮬레이터를 구현하여 현실적인 설명 대화 및 능동적 동의 처리를 가능하게 합니다. 개인화를 넘어 KnowU-Bench는 기반 GUI 실행, 동의 협상, 거부 후 자제를 포함한 완전한 능동적 의사결정 체인의 종합적 평가를 제공하며, 규칙 기반 검증과 LLM-as-a-Judge 점수 평가를 결합한 하이브리드 프로토콜로 평가합니다. 실험 결과, 명시적 작업 실행에 뛰어난 에이전트도 사용자 선호도 추론이나 개입 조정이 필요한 모호한 지시 아래에서는 Claude Sonnet 4.6과 같은 최첨단 모델조차 50% 미만 성능으로 급격히 저하되는 현상을 확인했습니다. 핵심 병목 현상은 GUI 탐색이 아닌 선호도 획득과 개입 조정에 있으며, 이는 유능한 인터페이스 운영과 신뢰할 수 있는 개인 지원 간의 근본적 격차를 드러냅니다.

English

Personalized mobile agents that infer user preferences and calibrate proactive assistance hold great promise as everyday digital assistants, yet existing benchmarks fail to capture what this requires. Prior work evaluates preference recovery from static histories or intent prediction from fixed contexts. Neither tests whether an agent can elicit missing preferences through interaction, nor whether it can decide when to intervene, seek consent, or remain silent in a live GUI environment. We introduce KnowU-Bench, an online benchmark for personalized mobile agents built on a reproducible Android emulation environment, covering 42 general GUI tasks, 86 personalized tasks, and 64 proactive tasks. Unlike prior work that treats user preferences as static context, KnowU-Bench hides the user profile from the agent and exposes only behavioral logs, forcing genuine preference inference rather than context lookup. To support multi-turn preference elicitation, it instantiates an LLM-driven user simulator grounded in structured profiles, enabling realistic clarification dialogues and proactive consent handling. Beyond personalization, KnowU-Bench provides comprehensive evaluation of the complete proactive decision chain, including grounded GUI execution, consent negotiation, and post-rejection restraint, evaluated through a hybrid protocol combining rule-based verification with LLM-as-a-Judge scoring. Our experiments reveal a striking degradation: agents that excel at explicit task execution fall below 50% under vague instructions requiring user preference inference or intervention calibration, even for frontier models like Claude Sonnet 4.6. The core bottlenecks are not GUI navigation but preference acquisition and intervention calibration, exposing a fundamental gap between competent interface operation and trustworthy personal assistance.

KnowU-Bench: 상호작용적, 능동적, 개인 맞춤형 모바일 에이전트 평가를 향하여

KnowU-Bench: Towards Interactive, Proactive, and Personalized Mobile Agent Evaluation

초록

Support