簡単なインタラクションでLLMから有害なジェイルブレイクを引き出す

要旨

広範な安全整合取り組みにもかかわらず、大規模言語モデル（LLM）は有害な行動を誘発するジェイルブレイク攻撃に依然として脆弱です。既存の研究は主に技術的専門知識を必要とする攻撃手法に焦点を当てていますが、未だに未探索の重要な2つの問いが残っています：（1）ジェイルブレイクされた応答が、一般ユーザーが有害な行動を行うのに本当に役立つのか？（2）より一般的で単純な人間-LLMの相互作用に安全上の脆弱性が存在するか？本論文では、LLMの応答が最も効果的に有害な行動を促進するのは、それらが実行可能で情報提供が容易な場合であることを示します。この洞察を活用して、有害行動を可能にするLLMの応答の効果を測定するジェイルブレイク指標であるHarmScoreと、簡単な多段階、多言語攻撃フレームワークであるSpeak Easyを提案します。特筆すべきは、Speak Easyを直接リクエストとジェイルブレイクの基準に組み込むことで、オープンソースおよびプロプライエタリLLMの4つの安全基準全体で、攻撃成功率が平均0.319、HarmScoreが平均0.426向上することです。私たちの研究は、悪意のあるユーザーが一般的な相互作用パターンを容易に悪用して有害な意図を実行できるという、重要でありながらしばしば見過ごされている脆弱性を明らかにします。

English

Despite extensive safety alignment efforts, large language models (LLMs) remain vulnerable to jailbreak attacks that elicit harmful behavior. While existing studies predominantly focus on attack methods that require technical expertise, two critical questions remain underexplored: (1) Are jailbroken responses truly useful in enabling average users to carry out harmful actions? (2) Do safety vulnerabilities exist in more common, simple human-LLM interactions? In this paper, we demonstrate that LLM responses most effectively facilitate harmful actions when they are both actionable and informative--two attributes easily elicited in multi-step, multilingual interactions. Using this insight, we propose HarmScore, a jailbreak metric that measures how effectively an LLM response enables harmful actions, and Speak Easy, a simple multi-step, multilingual attack framework. Notably, by incorporating Speak Easy into direct request and jailbreak baselines, we see an average absolute increase of 0.319 in Attack Success Rate and 0.426 in HarmScore in both open-source and proprietary LLMs across four safety benchmarks. Our work reveals a critical yet often overlooked vulnerability: Malicious users can easily exploit common interaction patterns for harmful intentions.

簡単なインタラクションでLLMから有害なジェイルブレイクを引き出す

Speak Easy: Eliciting Harmful Jailbreaks from LLMs with Simple Interactions

要旨

Support