ChatPaper.aiChatPaper

手机智能助手真的尊重用户隐私吗?

Do Phone-Use Agents Respect Your Privacy?

April 1, 2026
作者: Zhengyang Tang, Ke Ji, Xidong Wang, Zihan Ye, Xinyuan Wang, Yiduo Guo, Ziniu Li, Chenxin Li, Jingyuan Hu, Shunian Chen, Tongxu Luo, Jiaxi Bi, Zeyu Qin, Shaobo Wang, Xin Lai, Pengyuan Lyu, Junyi Li, Can Xu, Chengquan Zhang, Han Hu, Ming Yan, Benyou Wang
cs.AI

摘要

本研究旨在探究手机智能体在执行良性移动任务时是否尊重用户隐私。由于隐私合规行为尚未在手机智能体中实现可操作化定义,且普通应用程序不会披露智能体在执行过程中具体将哪些数据填入何种表单条目,该问题一直难以解答。为使该问题可量化,我们推出了MyPhoneBench——一个可验证的手机智能体隐私行为评估框架。我们通过最小化隐私合约iMy,将尊重隐私的手机使用操作化定义为权限许可访问、最小化披露和用户可控存储,并将其与经过工具化处理的模拟应用程序及基于规则的审计系统相结合,使不必要的权限请求、欺骗性重复披露和冗余表单填写行为变得可观测、可复现。通过对10款移动应用执行300项任务测试五大前沿模型,我们发现任务成功率、隐私合规任务完成度以及后续会话中对已保存偏好的使用能力是三种独立的能力维度,没有单一模型能在这三方面均占优势。联合评估任务成功率和隐私表现会重塑仅凭单一指标的模型排序。各模型最普遍存在的缺陷是基础的数据最小化原则:智能体仍会填写任务非必需的 optional 个人条目。这些结果表明,隐私漏洞源于对良性任务的过度"热心"执行,而仅以成功率作为评估标准会高估当前手机智能体的实际部署成熟度。所有代码、模拟应用及智能体运行轨迹已公开于~https://github.com/tangzhy/MyPhoneBench。
English
We study whether phone-use agents respect privacy while completing benign mobile tasks. This question has remained hard to answer because privacy-compliant behavior is not operationalized for phone-use agents, and ordinary apps do not reveal exactly what data agents type into which form entries during execution. To make this question measurable, we introduce MyPhoneBench, a verifiable evaluation framework for privacy behavior in mobile agents. We operationalize privacy-respecting phone use as permissioned access, minimal disclosure, and user-controlled memory through a minimal privacy contract, iMy, and pair it with instrumented mock apps plus rule-based auditing that make unnecessary permission requests, deceptive re-disclosure, and unnecessary form filling observable and reproducible. Across five frontier models on 10 mobile apps and 300 tasks, we find that task success, privacy-compliant task completion, and later-session use of saved preferences are distinct capabilities, and no single model dominates all three. Evaluating success and privacy jointly reshuffles the model ordering relative to either metric alone. The most persistent failure mode across models is simple data minimization: agents still fill optional personal entries that the task does not require. These results show that privacy failures arise from over-helpful execution of benign tasks, and that success-only evaluation overestimates the deployment readiness of current phone-use agents. All code, mock apps, and agent trajectories are publicly available at~ https://github.com/tangzhy/MyPhoneBench.
PDF11April 3, 2026