SafeArena: 自律型ウェブエージェントの安全性評価

要旨

LLMベースのエージェントは、ウェブベースのタスクを解決する能力がますます向上しています。この能力とともに、オンラインフォーラムでの誤情報の投稿やウェブサイトでの違法な物質の販売など、悪意のある目的での誤用のリスクも高まっています。これらのリスクを評価するために、私たちはSafeArenaを提案します。SafeArenaは、ウェブエージェントの意図的な誤用に焦点を当てた最初のベンチマークです。SafeArenaは、4つのウェブサイトにわたる250の安全なタスクと250の有害なタスクで構成されています。有害なタスクは、誤情報、違法行為、ハラスメント、サイバー犯罪、社会的偏見の5つの危害カテゴリに分類され、ウェブエージェントの現実的な誤用を評価するように設計されています。私たちは、GPT-4o、Claude-3.5 Sonnet、Qwen-2-VL 72B、Llama-3.2 90Bなどの主要なLLMベースのウェブエージェントをこのベンチマークで評価しました。有害なタスクに対するエージェントの脆弱性を体系的に評価するために、エージェントの行動を4つのリスクレベルに分類するAgent Risk Assessmentフレームワークを導入しました。エージェントが悪意のあるリクエストに驚くほど従順であることがわかり、GPT-4oとQwen-2はそれぞれ34.7％と27.3％の有害なリクエストを完了しました。私たちの調査結果は、ウェブエージェントの安全性を確保するための手順が緊急に必要であることを強調しています。私たちのベンチマークはこちらで利用可能です：https://safearena.github.io

English

LLM-based agents are becoming increasingly proficient at solving web-based tasks. With this capability comes a greater risk of misuse for malicious purposes, such as posting misinformation in an online forum or selling illicit substances on a website. To evaluate these risks, we propose SafeArena, the first benchmark to focus on the deliberate misuse of web agents. SafeArena comprises 250 safe and 250 harmful tasks across four websites. We classify the harmful tasks into five harm categories -- misinformation, illegal activity, harassment, cybercrime, and social bias, designed to assess realistic misuses of web agents. We evaluate leading LLM-based web agents, including GPT-4o, Claude-3.5 Sonnet, Qwen-2-VL 72B, and Llama-3.2 90B, on our benchmark. To systematically assess their susceptibility to harmful tasks, we introduce the Agent Risk Assessment framework that categorizes agent behavior across four risk levels. We find agents are surprisingly compliant with malicious requests, with GPT-4o and Qwen-2 completing 34.7% and 27.3% of harmful requests, respectively. Our findings highlight the urgent need for safety alignment procedures for web agents. Our benchmark is available here: https://safearena.github.io