**KnowU-Bench：邁向互動式、主動式與個人化行動代理評估**

摘要

能夠推斷用戶偏好並校準主動協助的個性化移動代理，作為日常數位助手具有巨大潛力，但現有基準測試未能捕捉其所需能力。先前研究要麼評估從靜態歷史記錄中恢復偏好的能力，要麼評估在固定情境下的意圖預測能力。這些測試既無法檢驗代理能否通過互動獲取缺失偏好，也無法驗證其能否在實時圖形用戶界面環境中決策何時介入、徵求同意或保持沉默。我們推出KnowU-Bench——基於可複現Android模擬環境建構的個性化移動代理線上基準測試，涵蓋42項通用GUI任務、86項個性化任務及64項主動式任務。有別於將用戶偏好視為靜態情境的先前研究，KnowU-Bench對代理隱藏用戶檔案，僅開放行為日誌，迫使系統進行真實的偏好推斷而非情境查詢。為支持多輪偏好獲取，該基準通過結構化檔案驅動的LLM用戶模擬器，實現逼真的澄清對話與主動同意處理。除個性化外，KnowU-Bench還提供完整主動決策鏈的綜合評估，包括基於GUI的實體執行、同意協商及遭拒後克制行為，並通過規則驗證與LLM評判相結合的混合協議進行評測。實驗結果顯示顯著性能衰退：即便如Claude Sonnet 4.6等前沿模型，在需要用戶偏好推斷或介入校準的模糊指令下，擅長顯性任務執行的代理成功率亦跌破50%。核心瓶頸並非GUI導航，而是偏好獲取與介入校準，這揭示了勝任界面操作與實現可信賴個性化輔助之間的根本差距。

English

Personalized mobile agents that infer user preferences and calibrate proactive assistance hold great promise as everyday digital assistants, yet existing benchmarks fail to capture what this requires. Prior work evaluates preference recovery from static histories or intent prediction from fixed contexts. Neither tests whether an agent can elicit missing preferences through interaction, nor whether it can decide when to intervene, seek consent, or remain silent in a live GUI environment. We introduce KnowU-Bench, an online benchmark for personalized mobile agents built on a reproducible Android emulation environment, covering 42 general GUI tasks, 86 personalized tasks, and 64 proactive tasks. Unlike prior work that treats user preferences as static context, KnowU-Bench hides the user profile from the agent and exposes only behavioral logs, forcing genuine preference inference rather than context lookup. To support multi-turn preference elicitation, it instantiates an LLM-driven user simulator grounded in structured profiles, enabling realistic clarification dialogues and proactive consent handling. Beyond personalization, KnowU-Bench provides comprehensive evaluation of the complete proactive decision chain, including grounded GUI execution, consent negotiation, and post-rejection restraint, evaluated through a hybrid protocol combining rule-based verification with LLM-as-a-Judge scoring. Our experiments reveal a striking degradation: agents that excel at explicit task execution fall below 50% under vague instructions requiring user preference inference or intervention calibration, even for frontier models like Claude Sonnet 4.6. The core bottlenecks are not GUI navigation but preference acquisition and intervention calibration, exposing a fundamental gap between competent interface operation and trustworthy personal assistance.

KnowU-Bench：邁向互動式、主動式與個人化行動代理評估

KnowU-Bench: Towards Interactive, Proactive, and Personalized Mobile Agent Evaluation

摘要

Support