SafeArena: 자율 웹 에이전트의 안전성 평가

초록

LLM 기반 에이전트는 웹 기반 작업을 해결하는 데 점점 더 능숙해지고 있습니다. 이러한 능력과 함께, 온라인 포럼에 허위 정보를 게시하거나 웹사이트에서 불법 물질을 판매하는 등 악의적인 목적으로 오용될 위험도 커지고 있습니다. 이러한 위험을 평가하기 위해, 우리는 웹 에이전트의 고의적 오용에 초점을 맞춘 첫 번째 벤치마크인 SafeArena를 제안합니다. SafeArena는 네 개의 웹사이트에서 250개의 안전한 작업과 250개의 유해한 작업으로 구성됩니다. 우리는 유해한 작업을 허위 정보, 불법 활동, 괴롭힘, 사이버 범죄, 사회적 편견이라는 다섯 가지 유해 범주로 분류하여 웹 에이전트의 현실적인 오용을 평가하도록 설계했습니다. 우리는 GPT-4o, Claude-3.5 Sonnet, Qwen-2-VL 72B, Llama-3.2 90B를 포함한 주요 LLM 기반 웹 에이전트를 이 벤치마크에서 평가했습니다. 유해 작업에 대한 이들의 취약성을 체계적으로 평가하기 위해, 우리는 에이전트 행동을 네 가지 위험 수준으로 분류하는 에이전트 위험 평가 프레임워크를 도입했습니다. 우리는 에이전트들이 악의적인 요청에 놀라울 정도로 순응적이라는 것을 발견했는데, GPT-4o와 Qwen-2는 각각 유해 요청의 34.7%와 27.3%를 완료했습니다. 우리의 연구 결과는 웹 에이전트를 위한 안전 조정 절차의 시급한 필요성을 강조합니다. 우리의 벤치마크는 https://safearena.github.io에서 확인할 수 있습니다.

English

LLM-based agents are becoming increasingly proficient at solving web-based tasks. With this capability comes a greater risk of misuse for malicious purposes, such as posting misinformation in an online forum or selling illicit substances on a website. To evaluate these risks, we propose SafeArena, the first benchmark to focus on the deliberate misuse of web agents. SafeArena comprises 250 safe and 250 harmful tasks across four websites. We classify the harmful tasks into five harm categories -- misinformation, illegal activity, harassment, cybercrime, and social bias, designed to assess realistic misuses of web agents. We evaluate leading LLM-based web agents, including GPT-4o, Claude-3.5 Sonnet, Qwen-2-VL 72B, and Llama-3.2 90B, on our benchmark. To systematically assess their susceptibility to harmful tasks, we introduce the Agent Risk Assessment framework that categorizes agent behavior across four risk levels. We find agents are surprisingly compliant with malicious requests, with GPT-4o and Qwen-2 completing 34.7% and 27.3% of harmful requests, respectively. Our findings highlight the urgent need for safety alignment procedures for web agents. Our benchmark is available here: https://safearena.github.io