JailbreakingのためのJailbreak

要旨

大規模言語モデル（LLM）に対する拒否訓練は有害な出力を防ぐが、この防御策は自動化されたものも人間が作成したものも含め、ジャイルブレイクに対して脆弱なままである。本研究では、人間が拒否訓練されたLLMをジャイルブレイクし、それ自体や他のLLMをジャイルブレイクする意欲を持たせるという、新たなLLM-as-red-teamerアプローチを提案する。ジャイルブレイクされたLLMをJ_2アタッカーと呼び、これが様々なレッドチーミング戦略を用いてターゲットモデルを体系的に評価し、過去の失敗からインコンテキスト学習を通じて性能を向上させることができる。実験では、Sonnet 3.5とGemini 1.5 proが他のLLMを上回るJ_2としての性能を示し、HarmbenchにおいてGPT-4oに対してそれぞれ93.0%と91.0%の攻撃成功率（ASR）を達成した（他の有力なLLMでも同様の結果が得られた）。本研究は、人間のレッドチーマーからインスピレーションを得た戦略的レッドチーミングのスケーラブルなアプローチを紹介するだけでなく、セーフガードの見過ごされていた失敗モードとしての「ジャイルブレイクによるジャイルブレイク」を浮き彫りにしている。具体的には、LLMは、自身のセーフガードを迂回するために、さらなるジャイルブレイクを支援する意欲を持つジャイルブレイク版の自身を利用することができる。J_2の直接的な悪用を防ぎつつ、AI安全性研究を進めるために、我々は特定のプロンプト詳細を非公開にしつつ、方法論を公開する。

English

Refusal training on Large Language Models (LLMs) prevents harmful outputs, yet this defense remains vulnerable to both automated and human-crafted jailbreaks. We present a novel LLM-as-red-teamer approach in which a human jailbreaks a refusal-trained LLM to make it willing to jailbreak itself or other LLMs. We refer to the jailbroken LLMs as J_2 attackers, which can systematically evaluate target models using various red teaming strategies and improve its performance via in-context learning from the previous failures. Our experiments demonstrate that Sonnet 3.5 and Gemini 1.5 pro outperform other LLMs as J_2, achieving 93.0% and 91.0% attack success rates (ASRs) respectively against GPT-4o (and similar results across other capable LLMs) on Harmbench. Our work not only introduces a scalable approach to strategic red teaming, drawing inspiration from human red teamers, but also highlights jailbreaking-to-jailbreak as an overlooked failure mode of the safeguard. Specifically, an LLM can bypass its own safeguards by employing a jailbroken version of itself that is willing to assist in further jailbreaking. To prevent any direct misuse with J_2, while advancing research in AI safety, we publicly share our methodology while keeping specific prompting details private.

JailbreakingのためのJailbreak

Jailbreaking to Jailbreak

要旨

Support