OS-Harm:衡量计算机使用代理安全性的基准
OS-Harm: A Benchmark for Measuring Safety of Computer Use Agents
June 17, 2025
作者: Thomas Kuntz, Agatha Duzan, Hao Zhao, Francesco Croce, Zico Kolter, Nicolas Flammarion, Maksym Andriushchenko
cs.AI
摘要
计算机使用代理是基于大型语言模型(LLM)的代理,能够通过处理屏幕截图或可访问性树直接与图形用户界面交互。尽管这些系统日益普及,但其安全性却大多被忽视,而评估和理解其潜在有害行为对于广泛采用至关重要。为填补这一空白,我们引入了OS-Harm,一个用于衡量计算机使用代理安全性的新基准。OS-Harm建立在OSWorld环境之上,旨在测试模型在三大类危害中的表现:用户故意滥用、提示注入攻击及模型不当行为。为涵盖这些情况,我们创建了150项任务,涉及多种安全违规类型(骚扰、版权侵犯、虚假信息、数据外泄等),并要求代理与多种操作系统应用(电子邮件客户端、代码编辑器、浏览器等)进行交互。此外,我们提出了一种自动化评判机制,用于评估代理的准确性与安全性,其与人工标注的一致性较高(F1分数分别为0.76和0.79)。我们基于一系列前沿模型(如o4-mini、Claude 3.7 Sonnet、Gemini 2.5 Pro)对计算机使用代理进行了评估,并提供了关于其安全性的深入见解。特别是,所有模型在面对许多故意滥用查询时倾向于直接遵从,对静态提示注入相对脆弱,并偶尔执行不安全操作。OS-Harm基准可在https://github.com/tml-epfl/os-harm获取。
English
Computer use agents are LLM-based agents that can directly interact with a
graphical user interface, by processing screenshots or accessibility trees.
While these systems are gaining popularity, their safety has been largely
overlooked, despite the fact that evaluating and understanding their potential
for harmful behavior is essential for widespread adoption. To address this gap,
we introduce OS-Harm, a new benchmark for measuring safety of computer use
agents. OS-Harm is built on top of the OSWorld environment and aims to test
models across three categories of harm: deliberate user misuse, prompt
injection attacks, and model misbehavior. To cover these cases, we create 150
tasks that span several types of safety violations (harassment, copyright
infringement, disinformation, data exfiltration, etc.) and require the agent to
interact with a variety of OS applications (email client, code editor, browser,
etc.). Moreover, we propose an automated judge to evaluate both accuracy and
safety of agents that achieves high agreement with human annotations (0.76 and
0.79 F1 score). We evaluate computer use agents based on a range of frontier
models - such as o4-mini, Claude 3.7 Sonnet, Gemini 2.5 Pro - and provide
insights into their safety. In particular, all models tend to directly comply
with many deliberate misuse queries, are relatively vulnerable to static prompt
injections, and occasionally perform unsafe actions. The OS-Harm benchmark is
available at https://github.com/tml-epfl/os-harm.