OS-Harm: コンピュータ利用エージェントの安全性を測定するためのベンチマーク

要旨

コンピュータ利用エージェントは、スクリーンショットやアクセシビリティツリーを処理することで、グラフィカルユーザーインターフェースと直接対話できるLLMベースのエージェントである。これらのシステムは人気を集めつつあるが、その安全性はほとんど注目されておらず、有害な行動の可能性を評価・理解することが広範な採用に不可欠であるにもかかわらず、この点が見過ごされている。このギャップを埋めるため、我々はOS-Harmを導入する。これは、コンピュータ利用エージェントの安全性を測定するための新しいベンチマークである。OS-HarmはOSWorld環境の上に構築されており、3つのカテゴリーの危害（意図的なユーザーの誤用、プロンプトインジェクション攻撃、モデルの誤動作）にわたってモデルをテストすることを目的としている。これらのケースをカバーするため、我々は150のタスクを作成し、それらはいくつかのタイプの安全違反（ハラスメント、著作権侵害、偽情報、データ流出など）にまたがり、エージェントがさまざまなOSアプリケーション（メールクライアント、コードエディタ、ブラウザなど）と対話することを要求する。さらに、エージェントの正確性と安全性を評価するための自動化されたジャッジを提案し、人間の注釈との高い一致（0.76および0.79のF1スコア）を達成する。我々は、o4-mini、Claude 3.7 Sonnet、Gemini 2.5 Proなどのフロンティアモデルに基づいてコンピュータ利用エージェントを評価し、その安全性に関する洞察を提供する。特に、すべてのモデルは多くの意図的な誤用クエリに直接従う傾向があり、静的なプロンプトインジェクションに対して比較的脆弱であり、時折安全でない行動を実行する。OS-Harmベンチマークはhttps://github.com/tml-epfl/os-harmで利用可能である。

English

Computer use agents are LLM-based agents that can directly interact with a graphical user interface, by processing screenshots or accessibility trees. While these systems are gaining popularity, their safety has been largely overlooked, despite the fact that evaluating and understanding their potential for harmful behavior is essential for widespread adoption. To address this gap, we introduce OS-Harm, a new benchmark for measuring safety of computer use agents. OS-Harm is built on top of the OSWorld environment and aims to test models across three categories of harm: deliberate user misuse, prompt injection attacks, and model misbehavior. To cover these cases, we create 150 tasks that span several types of safety violations (harassment, copyright infringement, disinformation, data exfiltration, etc.) and require the agent to interact with a variety of OS applications (email client, code editor, browser, etc.). Moreover, we propose an automated judge to evaluate both accuracy and safety of agents that achieves high agreement with human annotations (0.76 and 0.79 F1 score). We evaluate computer use agents based on a range of frontier models - such as o4-mini, Claude 3.7 Sonnet, Gemini 2.5 Pro - and provide insights into their safety. In particular, all models tend to directly comply with many deliberate misuse queries, are relatively vulnerable to static prompt injections, and occasionally perform unsafe actions. The OS-Harm benchmark is available at https://github.com/tml-epfl/os-harm.

OS-Harm: コンピュータ利用エージェントの安全性を測定するためのベンチマーク

OS-Harm: A Benchmark for Measuring Safety of Computer Use Agents

要旨

Support