KnowU-Bench：迈向交互式、主动化与个性化的移动智能体评估

摘要

能够推断用户偏好并校准主动服务能力的个性化移动智能体，作为日常数字助手具有巨大潜力，然而现有基准测试未能准确衡量其所需能力。既往研究或基于静态历史评估偏好恢复，或在固定情境下测试意图预测，两者均未检验智能体能否通过交互获取缺失偏好，也未考察其在实时图形界面环境中决定何时介入、征询许可或保持沉默的能力。我们推出KnowU-Bench——基于可复现Android模拟环境构建的个性化移动智能体在线基准，涵盖42项通用GUI任务、86项个性化任务及64项主动服务任务。与将用户偏好视为静态背景的既往研究不同，KnowU-Bench对智能体隐藏用户画像，仅开放行为日志，迫使系统进行真实的偏好推断而非背景查询。为支持多轮偏好获取，该平台基于结构化画像实例化LLM驱动的用户模拟器，实现逼真的澄清对话与主动许可协商。除个性化维度外，KnowU-Bench还通过结合规则验证与LLM评判的混合评估协议，对包含GUI实体操作、许可协商及遭拒后行为约束的完整主动决策链进行综合评价。实验结果显示性能断崖式下降：即便如Claude Sonnet 4.6等前沿模型，在需要推断用户偏好或校准干预时机的模糊指令下，原本擅长显式任务执行的智能体成功率骤降至50%以下。核心瓶颈并非GUI导航能力，而是偏好获取与干预校准，这揭示了熟练的界面操作与可信赖的个性化辅助之间存在根本性差距。

English

Personalized mobile agents that infer user preferences and calibrate proactive assistance hold great promise as everyday digital assistants, yet existing benchmarks fail to capture what this requires. Prior work evaluates preference recovery from static histories or intent prediction from fixed contexts. Neither tests whether an agent can elicit missing preferences through interaction, nor whether it can decide when to intervene, seek consent, or remain silent in a live GUI environment. We introduce KnowU-Bench, an online benchmark for personalized mobile agents built on a reproducible Android emulation environment, covering 42 general GUI tasks, 86 personalized tasks, and 64 proactive tasks. Unlike prior work that treats user preferences as static context, KnowU-Bench hides the user profile from the agent and exposes only behavioral logs, forcing genuine preference inference rather than context lookup. To support multi-turn preference elicitation, it instantiates an LLM-driven user simulator grounded in structured profiles, enabling realistic clarification dialogues and proactive consent handling. Beyond personalization, KnowU-Bench provides comprehensive evaluation of the complete proactive decision chain, including grounded GUI execution, consent negotiation, and post-rejection restraint, evaluated through a hybrid protocol combining rule-based verification with LLM-as-a-Judge scoring. Our experiments reveal a striking degradation: agents that excel at explicit task execution fall below 50% under vague instructions requiring user preference inference or intervention calibration, even for frontier models like Claude Sonnet 4.6. The core bottlenecks are not GUI navigation but preference acquisition and intervention calibration, exposing a fundamental gap between competent interface operation and trustworthy personal assistance.

KnowU-Bench：迈向交互式、主动化与个性化的移动智能体评估

KnowU-Bench: Towards Interactive, Proactive, and Personalized Mobile Agent Evaluation

摘要

Support