手机使用助手是否尊重您的隐私?
Do Phone-Use Agents Respect Your Privacy?
April 1, 2026
作者: Zhengyang Tang, Ke Ji, Xidong Wang, Zihan Ye, Xinyuan Wang, Yiduo Guo, Ziniu Li, Chenxin Li, Jingyuan Hu, Shunian Chen, Tongxu Luo, Jiaxi Bi, Zeyu Qin, Shaobo Wang, Xin Lai, Pengyuan Lyu, Junyi Li, Can Xu, Chengquan Zhang, Han Hu, Ming Yan, Benyou Wang
cs.AI
摘要
我们研究手机智能体在执行良性移动任务时是否尊重用户隐私。由于隐私合规行为尚未在手机智能体中实现操作化定义,且普通应用不会透露智能体在执行过程中具体将哪些数据填入何种表单条目,这一问题长期难以量化。为使该问题可测量,我们推出MyPhoneBench——一个可验证的手机智能体隐私行为评估框架。通过最小化隐私合约iMy,我们将尊重隐私的手机使用操作化定义为权限许可访问、最小化披露和用户可控存储,并将其与插桩模拟应用及基于规则的审计系统相结合,使不必要的权限请求、欺骗性重复披露和冗余表单填写行为变得可观测、可复现。基于10款移动应用和300项任务对五款前沿模型的测试表明:任务成功率、隐私合规任务完成度以及后续会话中对已保存偏好的使用能力是三种独立的能力维度,没有单一模型能在所有维度上领先。联合评估任务成功率和隐私表现会重塑仅基于单一指标的模型排序。所有模型中最突出的共性缺陷是数据最小化原则的违背:智能体仍会填写任务非必需的隐私条目。这些结果表明,隐私漏洞源于智能体过度"热心"地执行良性任务,而仅评估任务成功率会高估当前手机智能体的实际部署成熟度。所有代码、模拟应用及智能体运行轨迹已开源:https://github.com/tangzhy/MyPhoneBench。
English
We study whether phone-use agents respect privacy while completing benign mobile tasks. This question has remained hard to answer because privacy-compliant behavior is not operationalized for phone-use agents, and ordinary apps do not reveal exactly what data agents type into which form entries during execution. To make this question measurable, we introduce MyPhoneBench, a verifiable evaluation framework for privacy behavior in mobile agents. We operationalize privacy-respecting phone use as permissioned access, minimal disclosure, and user-controlled memory through a minimal privacy contract, iMy, and pair it with instrumented mock apps plus rule-based auditing that make unnecessary permission requests, deceptive re-disclosure, and unnecessary form filling observable and reproducible. Across five frontier models on 10 mobile apps and 300 tasks, we find that task success, privacy-compliant task completion, and later-session use of saved preferences are distinct capabilities, and no single model dominates all three. Evaluating success and privacy jointly reshuffles the model ordering relative to either metric alone. The most persistent failure mode across models is simple data minimization: agents still fill optional personal entries that the task does not require. These results show that privacy failures arise from over-helpful execution of benign tasks, and that success-only evaluation overestimates the deployment readiness of current phone-use agents. All code, mock apps, and agent trajectories are publicly available at~ https://github.com/tangzhy/MyPhoneBench.