课程校正：使用合成偏好进行安全对齐

摘要

大语言模型（LLMs）生成有害内容的风险变得日益关键。本文针对评估和提升LLMs执行纠错任务的能力进行了系统研究，即模型能够自主避免生成有害内容。首先，我们引入了C^2-Eval基准用于定量评估，并分析了10个流行的LLMs，揭示了当前安全调整的LLMs在纠错方面的不同熟练程度。为了改进，我们提出使用偏好学习对LLMs进行微调，强调对及时纠错的偏好。通过自动化流程，我们创建了C^2-Syn，一个包含75万对偏好的合成数据集，通过数据驱动的偏好学习向模型传授及时纠错的概念。对两个LLMs，Llama2-Chat 7B和Qwen2 7B进行的实验表明，我们的方法有效地增强了纠错技能，而不影响总体性能。此外，它有效地提高了LLMs的安全性，特别是抵抗越狱攻击。

English

The risk of harmful content generated by large language models (LLMs) becomes a critical concern. This paper presents a systematic study on assessing and improving LLMs' capability to perform the task of course-correction, \ie, the model can steer away from generating harmful content autonomously. To start with, we introduce the C^2-Eval benchmark for quantitative assessment and analyze 10 popular LLMs, revealing varying proficiency of current safety-tuned LLMs in course-correction. To improve, we propose fine-tuning LLMs with preference learning, emphasizing the preference for timely course-correction. Using an automated pipeline, we create C^2-Syn, a synthetic dataset with 750K pairwise preferences, to teach models the concept of timely course-correction through data-driven preference learning. Experiments on 2 LLMs, Llama2-Chat 7B and Qwen2 7B, show that our method effectively enhances course-correction skills without affecting general performance. Additionally, it effectively improves LLMs' safety, particularly in resisting jailbreak attacks.

课程校正：使用合成偏好进行安全对齐

Course-Correction: Safety Alignment Using Synthetic Preferences

摘要

Support