越狱：LLM安全训练为何失败？

摘要

针对安全性和无害性训练的大型语言模型仍然容易受到恶意利用，正如早期版本的ChatGPT遭受“越狱”攻击并引发不良行为所证明的那样。我们不仅要认识到这一问题，还要研究为何此类攻击会成功以及如何创建这些攻击。我们假设安全训练存在两种失败模式：竞争目标和泛化不匹配。竞争目标是指当模型的能力与安全目标发生冲突时，而泛化不匹配则是指安全训练未能泛化到模型具备能力的领域。我们利用这些失败模式来指导越狱设计，然后评估包括OpenAI的GPT-4和Anthropic的Claude v1.3在内的最新模型，针对现有和新设计的攻击进行评估。我们发现，尽管这些模型背后进行了大量的红队测试和安全训练工作，但仍然存在漏洞。值得注意的是，利用我们的失败模式的新攻击在模型的红队评估集合中的每个提示中都取得成功，并且优于现有的临时越狱攻击。我们的分析强调了安全能力的平衡的必要性——即安全机制应该与基础模型一样复杂，并反对仅靠扩展规模就能解决这些安全失败模式的观点。

English

Large language models trained for safety and harmlessness remain susceptible to adversarial misuse, as evidenced by the prevalence of "jailbreak" attacks on early releases of ChatGPT that elicit undesired behavior. Going beyond recognition of the issue, we investigate why such attacks succeed and how they can be created. We hypothesize two failure modes of safety training: competing objectives and mismatched generalization. Competing objectives arise when a model's capabilities and safety goals conflict, while mismatched generalization occurs when safety training fails to generalize to a domain for which capabilities exist. We use these failure modes to guide jailbreak design and then evaluate state-of-the-art models, including OpenAI's GPT-4 and Anthropic's Claude v1.3, against both existing and newly designed attacks. We find that vulnerabilities persist despite the extensive red-teaming and safety-training efforts behind these models. Notably, new attacks utilizing our failure modes succeed on every prompt in a collection of unsafe requests from the models' red-teaming evaluation sets and outperform existing ad hoc jailbreaks. Our analysis emphasizes the need for safety-capability parity -- that safety mechanisms should be as sophisticated as the underlying model -- and argues against the idea that scaling alone can resolve these safety failure modes.

越狱：LLM安全训练为何失败？

Jailbroken: How Does LLM Safety Training Fail?

摘要

Support