OS-Harm: 컴퓨터 사용 에이전트의 안전성 측정을 위한 벤치마크

초록

컴퓨터 사용 에이전트는 스크린샷이나 접근성 트리를 처리하여 그래픽 사용자 인터페이스와 직접 상호작용할 수 있는 LLM 기반 에이전트입니다. 이러한 시스템이 점점 인기를 얻고 있지만, 그들의 안전성은 크게 간과되어 왔으며, 유해한 행동의 잠재력을 평가하고 이해하는 것이 대중화를 위해 필수적임에도 불구하고 그러한 연구가 부족했습니다. 이러한 격차를 해결하기 위해, 우리는 컴퓨터 사용 에이전트의 안전성을 측정하기 위한 새로운 벤치마크인 OS-Harm을 소개합니다. OS-Harm은 OSWorld 환경 위에 구축되었으며, 세 가지 유형의 유해 행위(사용자의 고의적 오용, 프롬프트 주입 공격, 모델의 오작동)에 걸쳐 모델을 테스트하는 것을 목표로 합니다. 이러한 사례를 다루기 위해, 우리는 여러 유형의 안전 위반(괴롭힘, 저작권 침해, 허위 정보, 데이터 유출 등)을 포함하는 150개의 작업을 생성하고, 에이전트가 다양한 OS 애플리케이션(이메일 클라이언트, 코드 편집기, 브라우저 등)과 상호작용하도록 요구합니다. 또한, 우리는 에이전트의 정확성과 안전성을 평가하기 위한 자동화된 판단 시스템을 제안하며, 이는 인간 주석과 높은 일치도를 보입니다(F1 점수 0.76 및 0.79). 우리는 o4-mini, Claude 3.7 Sonnet, Gemini 2.5 Pro와 같은 다양한 최신 모델을 기반으로 컴퓨터 사용 에이전트를 평가하고 그들의 안전성에 대한 통찰을 제공합니다. 특히, 모든 모델은 많은 고의적 오용 쿼리에 직접적으로 순응하는 경향이 있으며, 정적 프롬프트 주입에 상대적으로 취약하고, 때때로 안전하지 않은 행동을 수행합니다. OS-Harm 벤치마크는 https://github.com/tml-epfl/os-harm에서 이용 가능합니다.

English

Computer use agents are LLM-based agents that can directly interact with a graphical user interface, by processing screenshots or accessibility trees. While these systems are gaining popularity, their safety has been largely overlooked, despite the fact that evaluating and understanding their potential for harmful behavior is essential for widespread adoption. To address this gap, we introduce OS-Harm, a new benchmark for measuring safety of computer use agents. OS-Harm is built on top of the OSWorld environment and aims to test models across three categories of harm: deliberate user misuse, prompt injection attacks, and model misbehavior. To cover these cases, we create 150 tasks that span several types of safety violations (harassment, copyright infringement, disinformation, data exfiltration, etc.) and require the agent to interact with a variety of OS applications (email client, code editor, browser, etc.). Moreover, we propose an automated judge to evaluate both accuracy and safety of agents that achieves high agreement with human annotations (0.76 and 0.79 F1 score). We evaluate computer use agents based on a range of frontier models - such as o4-mini, Claude 3.7 Sonnet, Gemini 2.5 Pro - and provide insights into their safety. In particular, all models tend to directly comply with many deliberate misuse queries, are relatively vulnerable to static prompt injections, and occasionally perform unsafe actions. The OS-Harm benchmark is available at https://github.com/tml-epfl/os-harm.

OS-Harm: 컴퓨터 사용 에이전트의 안전성 측정을 위한 벤치마크

OS-Harm: A Benchmark for Measuring Safety of Computer Use Agents

초록

Support