警惕“第三只眼”！评估多模态大语言模型驱动的智能手机助手的隐私意识

摘要

智能手机为用户带来了极大的便利，同时也使得设备能够广泛记录各类个人信息。当前，基于多模态大语言模型（MLLMs）的智能手机代理在自动化执行多种任务方面已展现出卓越性能。然而，作为代价，这些代理在运行过程中被赋予了访问用户敏感个人信息的广泛权限。为了深入理解这些代理的隐私意识，我们首次构建了一个包含7,138个场景的大规模基准测试，据我们所知，这是目前最全面的。此外，针对场景中的隐私内容，我们详细标注了其类型（如账户凭证）、敏感度等级及位置信息。随后，我们对七款主流智能手机代理进行了细致的基准测试。结果显示，几乎所有被测试的代理在隐私意识（RA）方面表现不尽如人意，即便在明确提示下，其性能仍低于60%。总体而言，闭源代理在隐私保护能力上优于开源代理，其中Gemini 2.0-flash表现最佳，RA达到67%。我们还发现，代理的隐私检测能力与场景敏感度高度相关，即敏感度越高的场景通常越容易被识别。我们期望这些发现能启发研究界重新思考智能手机代理在效用与隐私之间不平衡的权衡问题。我们的代码与基准测试数据可在https://zhixin-l.github.io/SAPA-Bench获取。

English

Smartphones bring significant convenience to users but also enable devices to extensively record various types of personal information. Existing smartphone agents powered by Multimodal Large Language Models (MLLMs) have achieved remarkable performance in automating different tasks. However, as the cost, these agents are granted substantial access to sensitive users' personal information during this operation. To gain a thorough understanding of the privacy awareness of these agents, we present the first large-scale benchmark encompassing 7,138 scenarios to the best of our knowledge. In addition, for privacy context in scenarios, we annotate its type (e.g., Account Credentials), sensitivity level, and location. We then carefully benchmark seven available mainstream smartphone agents. Our results demonstrate that almost all benchmarked agents show unsatisfying privacy awareness (RA), with performance remaining below 60% even with explicit hints. Overall, closed-source agents show better privacy ability than open-source ones, and Gemini 2.0-flash achieves the best, achieving an RA of 67%. We also find that the agents' privacy detection capability is highly related to scenario sensitivity level, i.e., the scenario with a higher sensitivity level is typically more identifiable. We hope the findings enlighten the research community to rethink the unbalanced utility-privacy tradeoff about smartphone agents. Our code and benchmark are available at https://zhixin-l.github.io/SAPA-Bench.

警惕“第三只眼”！评估多模态大语言模型驱动的智能手机助手的隐私意识

Mind the Third Eye! Benchmarking Privacy Awareness in MLLM-powered Smartphone Agents

摘要

Support