大型语言模型上的弱到强破解

摘要

尽管已经付出了大量努力来对齐大型语言模型（LLMs），但红队报告表明，这些经过精心对齐的LLMs仍然可能通过敌对提示、调整或解码而被越狱。在检查对齐LLMs的越狱漏洞时，我们发现越狱和对齐模型的解码分布仅在初始生成方面存在差异。这一观察结果激发了我们提出的弱到强越狱攻击，敌对方可以利用较小的不安全/对齐LLMs（例如7B）来引导对较大对齐LLMs（例如70B）的越狱攻击。要进行越狱，只需额外解码两个较小的LLMs一次，与解码较大的LLMs相比，这涉及的计算量和延迟都很小。通过在来自三个不同组织的五个模型上进行的实验来展示了这种攻击的有效性。我们的研究揭示了一种以前未被注意但高效的越狱方式，暴露了在对齐LLMs时需要考虑的紧急安全问题。作为一种初步尝试，我们提出了一种防御策略来防范此类攻击，但是创建更先进的防御措施仍然具有挑战性。可在以下网址找到复制该方法的代码：https://github.com/XuandongZhao/weak-to-strong

English

Although significant efforts have been dedicated to aligning large language models (LLMs), red-teaming reports suggest that these carefully aligned LLMs could still be jailbroken through adversarial prompts, tuning, or decoding. Upon examining the jailbreaking vulnerability of aligned LLMs, we observe that the decoding distributions of jailbroken and aligned models differ only in the initial generations. This observation motivates us to propose the weak-to-strong jailbreaking attack, where adversaries can utilize smaller unsafe/aligned LLMs (e.g., 7B) to guide jailbreaking against significantly larger aligned LLMs (e.g., 70B). To jailbreak, one only needs to additionally decode two smaller LLMs once, which involves minimal computation and latency compared to decoding the larger LLMs. The efficacy of this attack is demonstrated through experiments conducted on five models from three different organizations. Our study reveals a previously unnoticed yet efficient way of jailbreaking, exposing an urgent safety issue that needs to be considered when aligning LLMs. As an initial attempt, we propose a defense strategy to protect against such attacks, but creating more advanced defenses remains challenging. The code for replicating the method is available at https://github.com/XuandongZhao/weak-to-strong

大型语言模型上的弱到强破解

Weak-to-Strong Jailbreaking on Large Language Models

摘要

Support