為何受保護的船隻仍會擱淺？對齊大型語言模型的安全機制往往錨定於模板區域

摘要

大型語言模型（LLMs）的安全對齊仍然存在脆弱性，因為它們的初始行為很容易被相對簡單的攻擊所破解。由於在輸入指令和模型初始輸出之間填充固定模板是現有LLMs的常見做法，我們假設這個模板是導致其脆弱性的關鍵因素：LLMs的安全相關決策過度依賴於模板區域的聚合信息，這在很大程度上影響了這些模型的安全行為。我們將此問題稱為模板錨定的安全對齊。在本文中，我們進行了大量實驗，並驗證了模板錨定的安全對齊在各種對齊的LLMs中普遍存在。我們的機制分析展示了它如何導致模型在遇到推理時破解攻擊時的易感性。此外，我們表明，將安全機制從模板區域中分離出來有望減輕對破解攻擊的脆弱性。我們鼓勵未來的研究開發更為穩健的安全對齊技術，以減少對模板區域的依賴。

English

The safety alignment of large language models (LLMs) remains vulnerable, as their initial behavior can be easily jailbroken by even relatively simple attacks. Since infilling a fixed template between the input instruction and initial model output is a common practice for existing LLMs, we hypothesize that this template is a key factor behind their vulnerabilities: LLMs' safety-related decision-making overly relies on the aggregated information from the template region, which largely influences these models' safety behavior. We refer to this issue as template-anchored safety alignment. In this paper, we conduct extensive experiments and verify that template-anchored safety alignment is widespread across various aligned LLMs. Our mechanistic analyses demonstrate how it leads to models' susceptibility when encountering inference-time jailbreak attacks. Furthermore, we show that detaching safety mechanisms from the template region is promising in mitigating vulnerabilities to jailbreak attacks. We encourage future research to develop more robust safety alignment techniques that reduce reliance on the template region.

為何受保護的船隻仍會擱淺？對齊大型語言模型的安全機制往往錨定於模板區域

Why Safeguarded Ships Run Aground? Aligned Large Language Models' Safety Mechanisms Tend to Be Anchored in The Template Region

摘要

Support