越獄：LLM 安全訓練為何失敗？

摘要

即使為安全性和無害性而訓練的大型語言模型仍然容易受到惡意濫用，這一點從對早期版本的ChatGPT發動"越獄"攻擊並引發不良行為的情況中可見一斑。我們不僅僅認識到了這個問題，還調查了為何這些攻擊會成功以及它們是如何被創造出來的。我們提出了安全訓練的兩種失敗模式：競爭目標和泛化不匹配。競爭目標是指當模型的能力和安全目標存在衝突時，而泛化不匹配則是指安全訓練無法應用於存在能力的領域。我們利用這些失敗模式來指導越獄設計，然後評估包括OpenAI的GPT-4和Anthropic的Claude v1.3在內的最新模型，針對現有和新設計的攻擊進行評估。我們發現，儘管這些模型背後進行了大量的紅隊測試和安全訓練，但仍存在漏洞。值得注意的是，利用我們提出的失敗模式的新攻擊在模型的紅隊評估集合中的每個提示中都能成功，並且優於現有的臨時越獄。我們的分析強調了安全能力的平行性的必要性——即安全機制應該與底層模型一樣複雜——並反對僅通過規模化就能解決這些安全失敗模式的觀點。

English

Large language models trained for safety and harmlessness remain susceptible to adversarial misuse, as evidenced by the prevalence of "jailbreak" attacks on early releases of ChatGPT that elicit undesired behavior. Going beyond recognition of the issue, we investigate why such attacks succeed and how they can be created. We hypothesize two failure modes of safety training: competing objectives and mismatched generalization. Competing objectives arise when a model's capabilities and safety goals conflict, while mismatched generalization occurs when safety training fails to generalize to a domain for which capabilities exist. We use these failure modes to guide jailbreak design and then evaluate state-of-the-art models, including OpenAI's GPT-4 and Anthropic's Claude v1.3, against both existing and newly designed attacks. We find that vulnerabilities persist despite the extensive red-teaming and safety-training efforts behind these models. Notably, new attacks utilizing our failure modes succeed on every prompt in a collection of unsafe requests from the models' red-teaming evaluation sets and outperform existing ad hoc jailbreaks. Our analysis emphasizes the need for safety-capability parity -- that safety mechanisms should be as sophisticated as the underlying model -- and argues against the idea that scaling alone can resolve these safety failure modes.

越獄：LLM 安全訓練為何失敗？

Jailbroken: How Does LLM Safety Training Fail?

摘要

Support