ChatPaper.aiChatPaper

SafeArena:評估自主網路代理的安全性

SafeArena: Evaluating the Safety of Autonomous Web Agents

March 6, 2025
作者: Ada Defne Tur, Nicholas Meade, Xing Han Lù, Alejandra Zambrano, Arkil Patel, Esin Durmus, Spandana Gella, Karolina Stańczak, Siva Reddy
cs.AI

摘要

基於大型語言模型(LLM)的代理在解決網路任務方面正變得日益熟練。然而,這種能力的提升也伴隨著更大的濫用風險,例如在線上論壇發布錯誤資訊或在網站上販售非法物質。為評估這些風險,我們提出了SafeArena,這是首個專注於網路代理故意濫用的基準測試。SafeArena包含四個網站上的250項安全任務和250項有害任務。我們將有害任務分為五類——錯誤資訊、非法活動、騷擾、網路犯罪和社會偏見,旨在評估網路代理的實際濫用情況。我們在該基準上評估了領先的基於LLM的網路代理,包括GPT-4o、Claude-3.5 Sonnet、Qwen-2-VL 72B和Llama-3.2 90B。為系統性地評估它們對有害任務的易感性,我們引入了代理風險評估框架,該框架將代理行為分為四個風險等級。我們發現,代理對惡意請求的順從程度令人驚訝,GPT-4o和Qwen-2分別完成了34.7%和27.3%的有害請求。我們的研究結果突顯了對網路代理進行安全對齊程序的迫切需求。我們的基準測試可在此處取得:https://safearena.github.io
English
LLM-based agents are becoming increasingly proficient at solving web-based tasks. With this capability comes a greater risk of misuse for malicious purposes, such as posting misinformation in an online forum or selling illicit substances on a website. To evaluate these risks, we propose SafeArena, the first benchmark to focus on the deliberate misuse of web agents. SafeArena comprises 250 safe and 250 harmful tasks across four websites. We classify the harmful tasks into five harm categories -- misinformation, illegal activity, harassment, cybercrime, and social bias, designed to assess realistic misuses of web agents. We evaluate leading LLM-based web agents, including GPT-4o, Claude-3.5 Sonnet, Qwen-2-VL 72B, and Llama-3.2 90B, on our benchmark. To systematically assess their susceptibility to harmful tasks, we introduce the Agent Risk Assessment framework that categorizes agent behavior across four risk levels. We find agents are surprisingly compliant with malicious requests, with GPT-4o and Qwen-2 completing 34.7% and 27.3% of harmful requests, respectively. Our findings highlight the urgent need for safety alignment procedures for web agents. Our benchmark is available here: https://safearena.github.io

Summary

AI-Generated Summary

PDF212March 10, 2025