大型語言模型的弱到強破解

摘要

儘管已經投入了大量努力來對齊大型語言模型（LLMs），紅隊報告表明，這些經過精心對齊的LLMs仍然可能通過對抗提示、調整或解碼而被破解。在檢驗對齊LLMs的破解漏洞時，我們觀察到破解和對齊模型的解碼分佈只在初始生成階段有所不同。這一觀察激發了我們提出弱到強破解攻擊的想法，對手可以利用較小的不安全/對齊LLMs（例如7B）來引導針對遠比其大得多的對齊LLMs（例如70B）的破解。要進行破解，只需要額外解碼兩個較小的LLMs一次，與解碼較大的LLMs相比，這涉及的計算量和延遲很少。通過對三個不同組織的五個模型進行的實驗，證明了這種攻擊的有效性。我們的研究揭示了一種以前未被注意但有效的破解方式，揭示了在對齊LLMs時需要考慮的一個迫切的安全問題。作為一次初步嘗試，我們提出了一種防禦策略來防範此類攻擊，但創建更先進的防禦措施仍然具有挑戰性。複製該方法的代碼可在以下網址找到：https://github.com/XuandongZhao/weak-to-strong

English

Although significant efforts have been dedicated to aligning large language models (LLMs), red-teaming reports suggest that these carefully aligned LLMs could still be jailbroken through adversarial prompts, tuning, or decoding. Upon examining the jailbreaking vulnerability of aligned LLMs, we observe that the decoding distributions of jailbroken and aligned models differ only in the initial generations. This observation motivates us to propose the weak-to-strong jailbreaking attack, where adversaries can utilize smaller unsafe/aligned LLMs (e.g., 7B) to guide jailbreaking against significantly larger aligned LLMs (e.g., 70B). To jailbreak, one only needs to additionally decode two smaller LLMs once, which involves minimal computation and latency compared to decoding the larger LLMs. The efficacy of this attack is demonstrated through experiments conducted on five models from three different organizations. Our study reveals a previously unnoticed yet efficient way of jailbreaking, exposing an urgent safety issue that needs to be considered when aligning LLMs. As an initial attempt, we propose a defense strategy to protect against such attacks, but creating more advanced defenses remains challenging. The code for replicating the method is available at https://github.com/XuandongZhao/weak-to-strong

大型語言模型的弱到強破解

Weak-to-Strong Jailbreaking on Large Language Models

摘要

Support