輕鬆交談:透過簡單互動從LLM中引發有害越獄

Speak Easy: Eliciting Harmful Jailbreaks from LLMs with Simple Interactions

February 6, 2025
作者: Yik Siu Chan, Narutatsu Ri, Yuxin Xiao, Marzyeh Ghassemi
cs.AI

摘要

儘管進行了廣泛的安全對齊工作,大型語言模型(LLMs)仍然容易受到越獄攻擊的影響,引發有害行為。雖然現有研究主要集中在需要技術專業知識的攻擊方法上,但仍有兩個關鍵問題尚未得到充分探討:(1)越獄回應是否真的有助於普通用戶執行有害行為?(2)在更常見、簡單的人-LLM互動中是否存在安全漏洞?在本文中,我們展示了當LLM回應既具有可操作性又具有信息性時,最有效地促使有害行為的方法--這兩個特性在多步驟、多語言互動中很容易引發。基於這一見解,我們提出了HarmScore,一種衡量LLM回應如何有效促使有害行為的越獄指標,以及Speak Easy,一種簡單的多步驟、多語言攻擊框架。值得注意的是,通過將Speak Easy納入直接請求和越獄基準線,我們在四個安全基準測試中觀察到攻擊成功率平均絕對增加了0.319,HarmScore增加了0.426,這包括開源和專有LLMs。我們的工作揭示了一個關鍵但常被忽視的漏洞:惡意用戶可以輕易地利用常見的互動模式來實現有害意圖。
English
Despite extensive safety alignment efforts, large language models (LLMs) remain vulnerable to jailbreak attacks that elicit harmful behavior. While existing studies predominantly focus on attack methods that require technical expertise, two critical questions remain underexplored: (1) Are jailbroken responses truly useful in enabling average users to carry out harmful actions? (2) Do safety vulnerabilities exist in more common, simple human-LLM interactions? In this paper, we demonstrate that LLM responses most effectively facilitate harmful actions when they are both actionable and informative--two attributes easily elicited in multi-step, multilingual interactions. Using this insight, we propose HarmScore, a jailbreak metric that measures how effectively an LLM response enables harmful actions, and Speak Easy, a simple multi-step, multilingual attack framework. Notably, by incorporating Speak Easy into direct request and jailbreak baselines, we see an average absolute increase of 0.319 in Attack Success Rate and 0.426 in HarmScore in both open-source and proprietary LLMs across four safety benchmarks. Our work reveals a critical yet often overlooked vulnerability: Malicious users can easily exploit common interaction patterns for harmful intentions.

Summary

AI-Generated Summary

PDF32February 7, 2025