每当感到不安全时请拒绝：通过分离式拒绝训练改善LLM的安全性

摘要

本研究解决了大型语言模型（LLMs）安全调整实践中的一个关键问题，即识别和解决拒绝位置偏见，该偏见存在于安全调整数据中，影响模型拒绝生成不安全内容的能力。我们引入了一种新方法，称为解耦拒绝训练（DeRTa），旨在赋予LLMs拒绝在任何响应位置生成有害提示的能力，从而显著增强其安全性能。DeRTa包括两个新颖组件：（1）带有有害响应前缀的最大似然估计（MLE），通过在安全响应开头附加一段有害响应来训练模型识别和避免不安全内容；（2）强化过渡优化（RTO），使模型能够在整个有害响应序列中始终从潜在危害过渡到安全拒绝。我们进行了实证评估，使用LLaMA3和Mistral模型系列在六种攻击场景中进行，结果表明我们的方法不仅提高了模型的安全性而且没有牺牲性能，还超过了诸如GPT-4等知名模型在抵御攻击方面的表现。重要的是，我们的方法成功抵御了最近的高级攻击方法（例如CodeAttack），这些方法已经破解了GPT-4和LLaMA3-70B-Instruct。我们的代码和数据可在https://github.com/RobustNLP/DeRTa 找到。

English

This study addresses a critical gap in safety tuning practices for Large Language Models (LLMs) by identifying and tackling a refusal position bias within safety tuning data, which compromises the models' ability to appropriately refuse generating unsafe content. We introduce a novel approach, Decoupled Refusal Training (DeRTa), designed to empower LLMs to refuse compliance to harmful prompts at any response position, significantly enhancing their safety capabilities. DeRTa incorporates two novel components: (1) Maximum Likelihood Estimation (MLE) with Harmful Response Prefix, which trains models to recognize and avoid unsafe content by appending a segment of harmful response to the beginning of a safe response, and (2) Reinforced Transition Optimization (RTO), which equips models with the ability to transition from potential harm to safety refusal consistently throughout the harmful response sequence. Our empirical evaluation, conducted using LLaMA3 and Mistral model families across six attack scenarios, demonstrates that our method not only improves model safety without compromising performance but also surpasses well-known models such as GPT-4 in defending against attacks. Importantly, our approach successfully defends recent advanced attack methods (e.g., CodeAttack) that have jailbroken GPT-4 and LLaMA3-70B-Instruct. Our code and data can be found at https://github.com/RobustNLP/DeRTa.

每当感到不安全时请拒绝：通过分离式拒绝训练改善LLM的安全性

Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training

摘要

Support