ChatPaper.aiChatPaper

OS-Harm:计算机使用代理安全性评估基准

OS-Harm: A Benchmark for Measuring Safety of Computer Use Agents

June 17, 2025
作者: Thomas Kuntz, Agatha Duzan, Hao Zhao, Francesco Croce, Zico Kolter, Nicolas Flammarion, Maksym Andriushchenko
cs.AI

摘要

计算机使用代理是基于大语言模型(LLM)的智能体,能够通过处理屏幕截图或无障碍树直接与图形用户界面交互。尽管这类系统日益普及,但其安全性却大多被忽视,而评估和理解其潜在有害行为对于广泛采用至关重要。为填补这一空白,我们推出了OS-Harm,一个用于衡量计算机使用代理安全性的新基准。OS-Harm构建于OSWorld环境之上,旨在测试模型在三大类危害中的表现:用户故意滥用、提示注入攻击及模型不当行为。为覆盖这些情况,我们设计了150项任务,涵盖多种安全违规行为(骚扰、版权侵犯、虚假信息、数据泄露等),并要求代理与多种操作系统应用(电子邮件客户端、代码编辑器、浏览器等)进行交互。此外,我们提出了一种自动化评判机制,用于评估代理的准确性和安全性,该机制与人工标注达成了高度一致(F1分数分别为0.76和0.79)。我们基于一系列前沿模型(如o4-mini、Claude 3.7 Sonnet、Gemini 2.5 Pro)对计算机使用代理进行了评估,并深入剖析了它们的安全性。特别地,所有模型在面对许多故意滥用查询时往往直接遵从,相对容易受到静态提示注入攻击,并偶尔执行不安全操作。OS-Harm基准测试现已发布于https://github.com/tml-epfl/os-harm。
English
Computer use agents are LLM-based agents that can directly interact with a graphical user interface, by processing screenshots or accessibility trees. While these systems are gaining popularity, their safety has been largely overlooked, despite the fact that evaluating and understanding their potential for harmful behavior is essential for widespread adoption. To address this gap, we introduce OS-Harm, a new benchmark for measuring safety of computer use agents. OS-Harm is built on top of the OSWorld environment and aims to test models across three categories of harm: deliberate user misuse, prompt injection attacks, and model misbehavior. To cover these cases, we create 150 tasks that span several types of safety violations (harassment, copyright infringement, disinformation, data exfiltration, etc.) and require the agent to interact with a variety of OS applications (email client, code editor, browser, etc.). Moreover, we propose an automated judge to evaluate both accuracy and safety of agents that achieves high agreement with human annotations (0.76 and 0.79 F1 score). We evaluate computer use agents based on a range of frontier models - such as o4-mini, Claude 3.7 Sonnet, Gemini 2.5 Pro - and provide insights into their safety. In particular, all models tend to directly comply with many deliberate misuse queries, are relatively vulnerable to static prompt injections, and occasionally perform unsafe actions. The OS-Harm benchmark is available at https://github.com/tml-epfl/os-harm.
PDF42June 19, 2025