KnowU-Bench：インタラクティブかつプロアクティブでパーソナライズされたモバイルエージェント評価を目指して

要旨

ユーザーの嗜好を推論し積極的な支援を調整するパーソナライズドモバイルエージェントは、日常的なデジタルアシスタントとして大きな可能性を秘めているが、既存のベンチマークはこの要件を捉えられていない。従来の研究は、静的な履歴からの嗜好回復や固定された文脈からの意図予測を評価してきた。これらはいずれも、エージェントが対話を通じて不足する嗜好を引き出せるか、あるいは実際のGUI環境において介入のタイミングや同意の取得、沈黙の判断をできるかどうかをテストするものではない。我々はKnowU-Benchを提案する。これは再現可能なAndroidエミュレーション環境上に構築されたパーソナライズドモバイルエージェントのオンラインベンチマークであり、42の一般GUIタスク、86のパーソナライズドタスク、64の積極的タスクを網羅する。ユーザー嗜好を静的な文脈として扱う従来研究と異なり、KnowU-Benchはユーザープロファイルをエージェントから隠蔽し行動ログのみを公開することで、文脈参照ではなく真の嗜好推論を強制する。マルチターン嗜好獲得を支援するため、構造化プロファイルに基づくLLM駆動のユーザーシミュレータを実装し、現実的な明確化対話と積極的同意処理を可能にしている。パーソナライゼーションに加え、KnowU-BenchはGUI実行、同意交渉、拒否後の自制を含む完全な積極的決定連鎖の総合評価を、ルールベース検証とLLM-as-a-Judge評価を組み合わせたハイブリッドプロトコルで提供する。実験結果は顕著な性能劣化を示す：明確なタスク実行で優れるエージェントも、ユーザー嗜好の推論や介入調整を要する曖昧な指示下ではClaude Sonnet 4.6のような最先端モデルですら50%以下に低下する。核心的なボトルネックはGUI操作ではなく、嗜好獲得と介入調整にあり、有能なインターフェース操作と信頼できるパーソナルアシスタントの間の根本的な隔たりを露呈している。

English

Personalized mobile agents that infer user preferences and calibrate proactive assistance hold great promise as everyday digital assistants, yet existing benchmarks fail to capture what this requires. Prior work evaluates preference recovery from static histories or intent prediction from fixed contexts. Neither tests whether an agent can elicit missing preferences through interaction, nor whether it can decide when to intervene, seek consent, or remain silent in a live GUI environment. We introduce KnowU-Bench, an online benchmark for personalized mobile agents built on a reproducible Android emulation environment, covering 42 general GUI tasks, 86 personalized tasks, and 64 proactive tasks. Unlike prior work that treats user preferences as static context, KnowU-Bench hides the user profile from the agent and exposes only behavioral logs, forcing genuine preference inference rather than context lookup. To support multi-turn preference elicitation, it instantiates an LLM-driven user simulator grounded in structured profiles, enabling realistic clarification dialogues and proactive consent handling. Beyond personalization, KnowU-Bench provides comprehensive evaluation of the complete proactive decision chain, including grounded GUI execution, consent negotiation, and post-rejection restraint, evaluated through a hybrid protocol combining rule-based verification with LLM-as-a-Judge scoring. Our experiments reveal a striking degradation: agents that excel at explicit task execution fall below 50% under vague instructions requiring user preference inference or intervention calibration, even for frontier models like Claude Sonnet 4.6. The core bottlenecks are not GUI navigation but preference acquisition and intervention calibration, exposing a fundamental gap between competent interface operation and trustworthy personal assistance.

KnowU-Bench：インタラクティブかつプロアクティブでパーソナライズされたモバイルエージェント評価を目指して

KnowU-Bench: Towards Interactive, Proactive, and Personalized Mobile Agent Evaluation

要旨

Support