大型语言模型上的弱到强破解
Weak-to-Strong Jailbreaking on Large Language Models
January 30, 2024
作者: Xuandong Zhao, Xianjun Yang, Tianyu Pang, Chao Du, Lei Li, Yu-Xiang Wang, William Yang Wang
cs.AI
摘要
尽管已经付出了大量努力来对齐大型语言模型(LLMs),但红队报告表明,这些经过精心对齐的LLMs仍然可能通过敌对提示、调整或解码而被越狱。在检查对齐LLMs的越狱漏洞时,我们发现越狱和对齐模型的解码分布仅在初始生成方面存在差异。这一观察结果激发了我们提出的弱到强越狱攻击,敌对方可以利用较小的不安全/对齐LLMs(例如7B)来引导对较大对齐LLMs(例如70B)的越狱攻击。要进行越狱,只需额外解码两个较小的LLMs一次,与解码较大的LLMs相比,这涉及的计算量和延迟都很小。通过在来自三个不同组织的五个模型上进行的实验来展示了这种攻击的有效性。我们的研究揭示了一种以前未被注意但高效的越狱方式,暴露了在对齐LLMs时需要考虑的紧急安全问题。作为一种初步尝试,我们提出了一种防御策略来防范此类攻击,但是创建更先进的防御措施仍然具有挑战性。可在以下网址找到复制该方法的代码:https://github.com/XuandongZhao/weak-to-strong
English
Although significant efforts have been dedicated to aligning large language
models (LLMs), red-teaming reports suggest that these carefully aligned LLMs
could still be jailbroken through adversarial prompts, tuning, or decoding.
Upon examining the jailbreaking vulnerability of aligned LLMs, we observe that
the decoding distributions of jailbroken and aligned models differ only in the
initial generations. This observation motivates us to propose the
weak-to-strong jailbreaking attack, where adversaries can utilize smaller
unsafe/aligned LLMs (e.g., 7B) to guide jailbreaking against significantly
larger aligned LLMs (e.g., 70B). To jailbreak, one only needs to additionally
decode two smaller LLMs once, which involves minimal computation and latency
compared to decoding the larger LLMs. The efficacy of this attack is
demonstrated through experiments conducted on five models from three different
organizations. Our study reveals a previously unnoticed yet efficient way of
jailbreaking, exposing an urgent safety issue that needs to be considered when
aligning LLMs. As an initial attempt, we propose a defense strategy to protect
against such attacks, but creating more advanced defenses remains challenging.
The code for replicating the method is available at
https://github.com/XuandongZhao/weak-to-strong