安全標準對所有人一視同仁嗎?大型語言模型的用戶特定安全評估
Is Safety Standard Same for Everyone? User-Specific Safety Evaluation of Large Language Models
February 20, 2025
作者: Yeonjun In, Wonjoong Kim, Kanghoon Yoon, Sungchul Kim, Mehrab Tanjim, Kibum Kim, Chanyoung Park
cs.AI
摘要
隨著大型語言模型(LLM)代理的應用日益廣泛,其安全漏洞也愈發顯著。現有的廣泛基準測試主要依賴於通用標準來評估LLM的各項安全指標,卻忽略了用戶特定的安全標準。然而,LLM的安全標準可能因用戶個體差異而有所不同,而非在所有用戶間保持一致性。這引發了一個關鍵的研究問題:在考慮用戶特定安全標準時,LLM代理是否能夠安全地行動?儘管這一問題對於LLM的安全使用至關重要,但目前尚無基準數據集來評估LLM的用戶特定安全性。為填補這一空白,我們引入了U-SAFEBENCH,這是首個旨在評估LLM用戶特定安全性的基準測試。我們對18個廣泛使用的LLM進行了評估,結果顯示當前LLM在考慮用戶特定安全標準時未能安全行動,這在該領域標誌著一項新發現。為應對這一漏洞,我們提出了一種基於思維鏈的簡單補救措施,並證明了其在提升用戶特定安全性方面的有效性。我們的基準測試和代碼可在https://github.com/yeonjun-in/U-SafeBench獲取。
English
As the use of large language model (LLM) agents continues to grow, their
safety vulnerabilities have become increasingly evident. Extensive benchmarks
evaluate various aspects of LLM safety by defining the safety relying heavily
on general standards, overlooking user-specific standards. However, safety
standards for LLM may vary based on a user-specific profiles rather than being
universally consistent across all users. This raises a critical research
question: Do LLM agents act safely when considering user-specific safety
standards? Despite its importance for safe LLM use, no benchmark datasets
currently exist to evaluate the user-specific safety of LLMs. To address this
gap, we introduce U-SAFEBENCH, the first benchmark designed to assess
user-specific aspect of LLM safety. Our evaluation of 18 widely used LLMs
reveals current LLMs fail to act safely when considering user-specific safety
standards, marking a new discovery in this field. To address this
vulnerability, we propose a simple remedy based on chain-of-thought,
demonstrating its effectiveness in improving user-specific safety. Our
benchmark and code are available at https://github.com/yeonjun-in/U-SafeBench.Summary
AI-Generated Summary