なぜ保護された船が座礁するのか？整列された大規模言語モデルの安全メカニズムはテンプレート領域に固定される傾向がある

要旨

大規模言語モデル（LLM）の安全性アライメントは依然として脆弱であり、比較的単純な攻撃によっても初期の動作が容易に「ジェイルブレイク」される可能性があります。既存のLLMでは、入力指示と初期モデル出力の間に固定テンプレートを埋め込むことが一般的な慣行であるため、このテンプレートが脆弱性の主要な要因であると仮定します。LLMの安全性に関する意思決定は、テンプレート領域からの集約情報に過度に依存しており、これがモデルの安全性行動に大きく影響を与えています。この問題を「テンプレート固定型安全性アライメント」と呼びます。本論文では、広範な実験を行い、テンプレート固定型安全性アライメントが様々なアライメントされたLLMに広く存在することを検証します。メカニズム分析を通じて、これが推論時のジェイルブレイク攻撃に対するモデルの脆弱性を引き起こす仕組みを明らかにします。さらに、安全性メカニズムをテンプレート領域から切り離すことが、ジェイルブレイク攻撃に対する脆弱性を軽減する上で有望であることを示します。今後の研究において、テンプレート領域への依存を減らすより堅牢な安全性アライメント技術の開発を推奨します。

English

The safety alignment of large language models (LLMs) remains vulnerable, as their initial behavior can be easily jailbroken by even relatively simple attacks. Since infilling a fixed template between the input instruction and initial model output is a common practice for existing LLMs, we hypothesize that this template is a key factor behind their vulnerabilities: LLMs' safety-related decision-making overly relies on the aggregated information from the template region, which largely influences these models' safety behavior. We refer to this issue as template-anchored safety alignment. In this paper, we conduct extensive experiments and verify that template-anchored safety alignment is widespread across various aligned LLMs. Our mechanistic analyses demonstrate how it leads to models' susceptibility when encountering inference-time jailbreak attacks. Furthermore, we show that detaching safety mechanisms from the template region is promising in mitigating vulnerabilities to jailbreak attacks. We encourage future research to develop more robust safety alignment techniques that reduce reliance on the template region.

なぜ保護された船が座礁するのか？整列された大規模言語モデルの安全メカニズムはテンプレート領域に固定される傾向がある

Why Safeguarded Ships Run Aground? Aligned Large Language Models' Safety Mechanisms Tend to Be Anchored in The Template Region

要旨

Support