大規模言語モデルにおける弱い対強いジェイルブレイキング

要旨

大規模言語モデル（LLM）のアラインメントには多大な努力が注がれてきたが、レッドチーミングレポートによると、これらの慎重にアラインメントされたLLMでも、敵対的なプロンプト、チューニング、またはデコードを通じてジャイルブレイクされる可能性がある。アラインメントされたLLMのジャイルブレイク脆弱性を調査すると、ジャイルブレイクされたモデルとアラインメントされたモデルのデコード分布は、初期の生成段階でのみ異なることが観察される。この観察結果から、我々は「弱から強へのジャイルブレイク攻撃」を提案する。この攻撃では、敵対者がより小さな安全でない/アラインメントされたLLM（例：7B）を利用して、大幅に大きなアラインメントされたLLM（例：70B）に対するジャイルブレイクを誘導することができる。ジャイルブレイクするためには、2つの小さなLLMを一度だけ追加でデコードするだけでよく、大きなLLMをデコードする場合と比べて計算量と遅延が最小限に抑えられる。この攻撃の有効性は、3つの異なる組織の5つのモデルで実施された実験を通じて実証された。本研究は、これまで気づかれていなかったが効率的なジャイルブレイク方法を明らかにし、LLMをアラインメントする際に考慮すべき緊急の安全性問題を暴露した。初期の試みとして、我々はこのような攻撃から保護するための防御戦略を提案するが、より高度な防御策の作成は依然として課題である。この手法を再現するためのコードはhttps://github.com/XuandongZhao/weak-to-strongで公開されている。

English

Although significant efforts have been dedicated to aligning large language models (LLMs), red-teaming reports suggest that these carefully aligned LLMs could still be jailbroken through adversarial prompts, tuning, or decoding. Upon examining the jailbreaking vulnerability of aligned LLMs, we observe that the decoding distributions of jailbroken and aligned models differ only in the initial generations. This observation motivates us to propose the weak-to-strong jailbreaking attack, where adversaries can utilize smaller unsafe/aligned LLMs (e.g., 7B) to guide jailbreaking against significantly larger aligned LLMs (e.g., 70B). To jailbreak, one only needs to additionally decode two smaller LLMs once, which involves minimal computation and latency compared to decoding the larger LLMs. The efficacy of this attack is demonstrated through experiments conducted on five models from three different organizations. Our study reveals a previously unnoticed yet efficient way of jailbreaking, exposing an urgent safety issue that needs to be considered when aligning LLMs. As an initial attempt, we propose a defense strategy to protect against such attacks, but creating more advanced defenses remains challenging. The code for replicating the method is available at https://github.com/XuandongZhao/weak-to-strong

大規模言語モデルにおける弱い対強いジェイルブレイキング

Weak-to-Strong Jailbreaking on Large Language Models

要旨

Support